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PREFACE TO THE 
THIRD EDITION 


The purpose of the third edition of this book is to give a sound and self-con¬ 
tained (in the sense that the necessary probability theory is included) introduction 
to classical or mainstream statistical theory. It is not a statistical-methods- 
cookbook, nor a compendium of statistical theories, nor is it a mathematics 
book. The book is intended to be a textbook, aimed for use in the traditional 
full-year upper-division undergraduate course in probability and statistics, 
or for use as a text in a course designed for first-year graduate students. The 
latter course is often a “service course,” offered to a variety of disciplines. 

No previous course in probability or statistics is needed in order to study 
the book. The mathematical preparation required is the conventional full-year 
calculus course which includes series expansion, multiple integration, and par¬ 
tial differentiation. Linear algebra is not required. An attempt has been 
made to talk to the reader. Also, we have retained the approach of presenting 
the theory with some connection to practical problems. The book is not mathe¬ 
matically rigorous. Proofs, and even exact statements of results, are often not 
given. Instead, we have tried to impart a “ feel ” for the theory. 

The book is designed to be used in either the quarter system or the semester 
system. In a quarter system, Chaps. I through V could be covered in the first 
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quarter. Chaps. VI through part of VIII the second quarter, and the rest of the 
book the third quarter. In a semester system, Chaps. I through VI could be 
covered the first semester and the remaining chapters the second semester. 
Chapter VI is a “ bridging ” chapter; it can be considered to be a part of “ proba¬ 
bility ” or a part of “ statistics.” Several sections or subsections can be omitted 
without disrupting the continuity of presentation. For example, any of the 
following could be omitted: Subsec. 4.5 of Chap. II; Subsecs., 2.6, 3.5, 4.2, and 
4.3 of Chap. Ill; Subsec. 5.3 of Chap. VI; Subsecs. 2.3, 3.4, 4.3 and Secs. 6 
through 9 of Chap. VII; Secs. 5 and 6 of Chap. VIII; Secs. 6 and 7 of Chap. IX; 
and all or part of Chaps. X and XI. Subsection 5.3 of Chap VI on extreme-value 
theory is somewhat more difficult than the rest of that chapter. In Chap. VII, 
Subsec. 7.1 on Bayes estimation can be taught without Subsec. 3.4 on loss and 
risk functions but Subsec. 7.2 cannot. Parts of Sec. 8 of Chap. VII utilize matrix 
notation. The many problems are intended to be essential for learning the 
material in the book. Some of the more difficult problems have been starred. 


ALEXANDER M. MOOD 
FRANKLIN A. GRAYBILL 
DUANE C. BOES 



EXCERPTS FROM THE FIRST 
AND SECOND EDITION PREFACES 


This book developed from a set of notes which I prepared in 1945. At that time 
there was no modern text available specifically designed for beginning students 
of mathematical statistics. Since then the situation has been relieved consider¬ 
ably, and had I known in advance what books were in the making it is likely 
that I should not have embarked on this volume. However, it seemed suffi¬ 
ciently different from other presentations to give prospective teachers and stu¬ 
dents a useful alternative choice. 

The aforementioned notes were used as text material for three years at Iowa 
State College in a course offered to senior and first-year graduate students. 
The only prerequisite for the course was one year of calculus, and this require¬ 
ment indicates the level of the book. (The calculus class at Iowa State met four 
hours per week and included good coverage of Taylor series, partial differentia¬ 
tion, and multiple integration.) No previous knowledge of statistics is assumed. 

This is a statistics book, not a mathematics book, as any mathematician 
will readily see. Little mathematical rigor is to be found in the derivations 
simply because it would be boring and largely a waste of time at this level. Of 
course rigorous thinking is quite essential to good statistics, and I have been at 
some pains to make a show of rigor and to instill an appreciation for rigor by 
pointing out various pitfalls of loose arguments. 
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While this text is primarily concerned with the theory of statistics, full 
cognizance has been taken of those students who fear that a moment may be 
wasted in mathematical frivolity. All new subjects are supplied with a little 
scenery from practical affairs, and, more important, a serious effort has been 
made in the problems to illustrate the variety of ways in which the theory may 
be applied. 

The problems are an essential part of the book. They range from simple 
numerical examples to theorems needed in subsequent chapters. They include 
important subjects which could easily take precedence over material in the text; 
the relegation of subjects to problems was based rather on the feasibility of such 
a procedure than on the priority of the subject. For example, the matter of 
correlation is dealt with almost entirely in the problems. It seemed to me in¬ 
efficient to cover multivariate situations twice in detail, i.e., with the regression 
model and with the correlation model. The emphasis in the text proper is on 
the more general regression model. 

The author of a textbook is indebted to practically everyone who has 
touched the field, and I here bow to all statisticians. However, in giving credit 
to contributors one must draw the line somewhere, and I have simplified matters 
by drawing it very high; only the most eminent contributors are mentioned in 
the book. 

I am indebted to Catherine Thompson and Maxine Merrington, and to 
E. S. Pearson, editor of Biometrika, for permission to include Tables III and V, 
which are abridged versions of tables published in Biometrika. I am also in¬ 
debted to Professors R. A. Fisher and Frank Yates, and to Messrs. Oliver and 
Boyd, Ltd., Edinburgh, for permission to reprint Table IV from their book 
“Statistical Tables for Use in Biological, Agricultural and Medical Research.” 

Since the first edition of this book was published in 1950 many new statis¬ 
tical techniques have been made available and many techniques that were only in 
the domain of the mathematical statistician are now useful and demanded by 
the applied statistician. To include some of this material we have had to elim¬ 
inate other material, else the book would have come to resemble a compendium. 
The general approach of presenting the theory with some connection to prac¬ 
tical problems apparently contributed significantly to the success of the first 
edition and we have tried to maintain that feature in the present edition. 
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PROBABILITY 


1 INTRODUCTION AND SUMMARY 

The purpose of this chapter is to define probability and discuss some of its prop¬ 
erties. Section 2 is a brief essay on some of the different meanings that have 
been attached to probability and may be omitted by those who are interested 
only in mathematical (axiomatic) probability, which is defined in Sec. 3 and 
used throughout the remainder of the text. Section 3 is subdivided into six 
subsections. The first, Subsec. 3.1, discusses the concept of probability models. 
It provides a real-world setting for the eventual mathematical definition of 
probability. A review of some of the set theoretical concepts that are relevant 
to probability is given in Subsec. 3.2. Sample space and event space are 
defined in Subsec. 3.3. Subsection 3.4 commences with a recall of the definition 
of a function. Such a definition is useful since many of the words to be defined 
in this and coming chapters (e.g., probability, random variable, distribution, 
etc.) are defined as particular functions. The indicator function, to be used 
extensively in later chapters, is defined here. The probability axioms are pre¬ 
sented, and the probability function is defined. Several properties of this prob¬ 
ability function are stated. The culmination of this subsection is the definition 
of a probability space. Subsection 3.5 is devoted to examples of probabilities 
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defined on finite sample spaces. The related concepts of independence of 
events and conditional probability are discussed in the sixth and final subsection. 
Bayes’ theorem, the multiplication rule, and the theorem of total probabilities 
are proved or derived, and examples of each are given. 

Of the three main sections included in this chapter, only Sec. 3, which is 
by far the longest, is vital. The definitions of probability, probability space, 
conditional probability, and independence, along with familiarity with the 
properties of probability, conditional and unconditional and related formulas, 
are the essence of this chapter. This chapter is a background chapter; it intro¬ 
duces the language of probability to be used in developing distribution theory, 
which is the backbone of the theory of statistics. 


2 KINDS OF PROBABILITY 


2.1 Introduction 

One of the fundamental tools of statistics is probability, which had its formal 
beginnings with games of chance in the seventeenth century. 

Games of chance, as the name implies, include such actions as spinning a 
roulette wheel, throwing dice, tossing a coin, drawing a card, etc., in which the 
outcome of a trial is uncertain. However, it is recognized that even though the 
outcome of any particular trial may be uncertain, there is a predictable long¬ 
term outcome. It is known, for example, that in many throws of an ideal 
(balanced, symmetrical) coin about one-half of the trials will result in heads. 
It is this long-term, predictable regularity that enables gaming houses to engage 
in the business. 

A similar type of uncertainty and long-term regularity often occurs in 
experimental science. For example, in the science of genetics it is uncertain 
whether an offspring will be male or female, but in the long run it is known 
approximately what percent of offspring will be male and what percent will be 
female. A life insurance company cannot predict which persons in the United 
States will die at age 50, but it can predict quite satisfactorily how many people 
in the United States will die at that age. - 

First we shall discuss the classical, or a priori, theory of probability; then 
we shall discuss the frequency theory. Development of the axiomatic approach 
will be deferred until Sec. 3. 
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2.2 Classical or A Priori Probability 

As we stated in the previous subsection, the theory of probability in its early 
stages was closely associated with games of chance. This association prompted 
the classical definition. For example, suppose that we want the probability of 
the eventthat an ideal coin will turn up heads. We argue in this manner: Since 
there are only two ways that the coin can fall, heads or tails, and since the coin 
is well balanced, one would expect that the coin is just as likely to fall heads as 
tails; hence, the probability of the event of a head will be given the value 
This kind of reasoning prompted the following classical definition of prob¬ 
ability. 

Definition 1 Classical probability If a random experiment can result 
in n mutually exclusive and equally likely outcomes and if n A of these 
outcomes have an attribute A, then the probability of A is the fraction 
njn. //// 

We shall apply this definition to a few examples in order to illustrate its meaning. 

If an ordinary die (one of a pair of dice) is tossed—there are six possible out¬ 
comes—any one of the six numbered faces may turn up. These six outcomes 
are mutually exclusive since two or more faces cannot turn up simultaneously. 
And if the die is fair, or true, the six outcomes are equally likely; i.e., it is expected 
that each face will appear with about equal relative frequency in the long run. 
Now suppose that we want the probability that the result of a toss be an even 
number. Three of the six possible outcomes have this attribute. The prob¬ 
ability that an even number will appear when a die is tossed is therefore y, or J. 
Similarly, the probability that a 5 will appear when a die is tossed is The 
probability that the result of a toss will be greater than 2 is §. 

To consider another example, suppose that a card is drawn at random from 
an ordinary deck of playing cards. The probability of drawing a spade is 
readily seen to be yj, or The probability of drawing a number between 5 
and iO, inclusive, is -f-f, or -fa. 

The application of the definition is straightforward enough in these simple 
cases, but it is not always so obvious. Careful attention must be paid to the 
qualifications “ mutually exclusive,” “ equally likely,” and “ random.” Suppose 
that one wishes to compute the probability of getting two heads if a coin is 
tossed twice. He might reason that there are three possible outcomes for the 
two tosses: two heads, two tails, or one head and one tail. One of these three 
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outcomes has the desired attribute, i.e., two heads; therefore the probability is 
J. This reasoning is faulty because the three given outcomes are not equally 
likely. The third outcome, one head and one tail, can occur in two ways 
since the head may appear on the first toss and the tail on the second or the 
head may appear on the second toss and the tail on the first. Thus there are 
four equally likely outcomes: HH, HT, TH, and TT. The first of these has 
the desired attribute, while the others do not. The correct probability is there¬ 
fore £. The result would be the same if two ideal coins were tossed simul¬ 
taneously. 

Again, suppose that one wished to compute the probability that a card 
drawn from an ordinary well-shuffled deck will be an ace or a spade. In enu¬ 
merating the favorable outcomes, one might count 4 aces and 13 spades and 
reason that there are 17 outcomes with the desired attribute. This is clearly 
incorrect because these 17 outcomes are not mutually exclusive since the ace of 
spades is both an ace and a spade. There are 16 outcomes that are favorable to 
an ace or a spade, and so the correct probability is , or . 

We note that by the classical definition the probability of event A is a 
number between 0 and 1 inclusive. The ratio n A /n must be less than or equal to 
1 since the total number of possible outcomes cannot be smaller than the 
number of outcomes with a specified attribute. If an event is certain to happen, 
its probability is 1; if it is certain not to happen, its probability is 0. Thus, the 
probability of obtaining an 8 in tossing a die is 0. The probability that the 
number showing when a die is tossed is less than 10 is equal to 1. 

The probabilities determined by the classical definition are called a priori 
probabilities. When one states that the probability of obtaining a head in 
tossing a coin is he has arrived at this result purely by deductive reasoning. 
The result does not require that any coin be tossed or even be at band. We say 
that if the coin is true, the probability of a head is j, but this is little more than 
saying the same thing in two different ways. Nothing is said about how one 
can determine whether or not a particular coin is true. 

The fact that we shall deal with ideal objects in developing a theory of 
probability will not trouble us because that is a common requirement of mathe¬ 
matical systems. Geometry, for example, deals with conceptually perfect 
circles, lines with zero width, and so forth, but it is a useful branch of knowl¬ 
edge, which can be applied to diverse practical problems. 

There are some rather troublesome limitations in the classical, or a priori, 
approach. It is obvious, for example, that the definition of probability must 
be modified somehow when the total number of possible outcomes is infinite. 
One might seek, for example, the probability that an integer drawn at random 
from the positive integers be even. The intuitive answer to this question is 
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If one were pressed to justify this result on the basis of the definition, he might 
reason as follows: Suppose that we limit ourselves to the first 20 integers; 10 
of these are even so that the ratio of favorable outcomes to the total number is 
or Again, if the first 200 integers are considered, 100 of these are even, 
and the ratio is also }. In general, the first 2 jV integers contain N even integers; 
if we form the ratio Nj2N and let N become infinite so as to encompass the whole 
set of positive integers, the ratio remains The above argument is plausible, 
and the answer is plausible, but it is no simple matter to make the argument 
stand up. It depends, for example, on the natural ordering of the positive 
integers, and a different ordering could produce a different result. Thus, one 
could just as well order the integers in this way: 1, 3, 2; 5, 7, 4; 9, 11, 6; ..., 
taking the first pair of odd integers then the first even integer, the second pair 
of odd integers then the second even integer, and so forth. With this ordering, 
one could argue that the probability of drawing an even integer is The 
integers can also be ordered so that the ratio will oscillate and never approach 
any definite value as N increases. 

There is another difficulty with the classical approach to the theory of 
probability which is deeper even than that arising in the case of an infinite 
number of outcomes. Suppose that we toss a coin known to be biased in 
favor of heads (it is bent so that a head is more likely to appear than a tail). 
The two possible outcomes of tossing the coin are not equally likely. What is 
the probability of a head ? The classical definition leaves us completely helpless 
here. 

Still another difficulty with the classical approach is encountered when we 
try to answer questions such as the following: What is the probability that a 
child born in Chicago will be a boy? Or what is the probability that a male 
will die before age 50? Or what is the probability that a cookie bought at a 
certain bakery will have less than three peanuts in it? All these are legitimate 
questions which we wantto bring into the realm of probability theory. However, 
notions of “symmetry,” “equally likely,” etc., cannot be utilized as they could 
be in games of chance. Thus we shall have to alter or extend our definition to 
bring problems similar to the above into the framework of the theory. This 
more widely applicable probability is called a posteriori probability, or frequency, 
and will be discussed in the next subsection. 


2.3 A Posteriori or Frequency Probability 

A coin which seemed to be well balanced and symmetrical was tossed 100 times, 
and the outcomes recorded in Table 1. The important thing to notice is that the 
relative frequency of heads is close to f This is not unexpected since the coin 
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was symmetrical, and it was anticipated that in the long run heads would occur 
about one-half of the time. For another example, a single die was thrown 300 
times, and the outcomes recorded in Table 2. Notice how close the relative 
frequency of a face with a 1 showing is to similarly for a 2, 3, 4, 5, and 6. 
These results are not unexpected since the die which was used was quite sym¬ 
metrical and balanced; it was expected that each face would occur with about 
equal frequency in the long run. This suggests that we might be willing to use 
this relative frequency in Table 1 as an approximation for the probability that 
the particular coin used will come up heads or we might be willing to use the 
relative frequencies in Table 2 as approximations for the probabilities that 
various numbers on this die will appear. Note that although the relative fre¬ 
quencies of the different outcomes are predictable, the actual outcome of an 
individual throw is unpredictable. 

In fact, it seems reasonable to assume for the coin experiment that there 
exists a number, label it p, which is the probability of a head. Now if the coin 
appears well balanced, symmetrical, and true, we might use Definition 1 and 
state that p is approximately equal to It is only an approximation to set p 
equal to \ since for this particular coin we cannot be certain that the two cases, 
heads and tails, are exactly equally likely. But by examining the balance and 
symmetry of the coin it may seem quite reasonable to assume that they are. 
Alternatively, the coin could be tossed a large number of times, the results 
recorded as in Table 1, and the relative frequency of a head used as an approxima¬ 
tion for p. In the experiment with a die, the probability p 2 of a 2 showing 
could be approximated by using Definition 1 or by using the relative frequency 
in Table 2. The important thing is that we postulate that there is a number p 
which is defined as the probability of a head with the coin or a number p 2 
which is the probability of a 2 showing in the throw of the die. Whether we use 
Definition 1 or the relative frequency for the probability seems unimportant in 
the examples cited. 


Table 1 RESULTS OF TOSSING A COIN 100 TIMES 


Outcome 

Observed 

F requency 

Observed relative 
frequency 

Long-run expected 
relative frequency 
of a balanced coin 

H 

56 

.56 

.50 

T 

44 

.44 

.50 

Total 

100 

1.00 

1.00 
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Suppose, as described above, that the coin is unbalanced so that we are 
quite certain from an examination that the two cases, heads and tails, are not 
equally likely do happen. In these cases a number p can still be postulated 
as the probability that a head shows, but the classical definition will not help us 
to find the value of p. We must use the frequency approach or possibly some 
physical analysis of the unbalanced coin. 

In many scientific investigations, observations are taken which have an ele¬ 
ment of uncertainty or unpredictability in them. As a very simple example, sup¬ 
pose that we want to predict whether the next baby born in a certain locality will 
be a male or a female. This is individually an uncertain event, but the results of 
groups of births can be dealt with satisfactorily. We find that a certain long- 
run regularity exists which is similar to the long-run regularity of the frequency 
ratio of a head when a coin is thrown. If, for example, we find upon examination 
of records that about 51 percent of the births are male, it might be reasonable to 
postulate that the probability of a male birth in this locality is equal to a number 
p and take .51 as its approximation. 

To make this idea more concrete, we shall assume that a series of observa¬ 
tions (or experiments) can be made under quite uniform conditions. That is, 
an observation of a random experiment is made; then the experiment is repeated 
under similar conditions, and another observation taken. This is repeated 
many times, and while the conditions are similar each time, there is an uncon¬ 
trollable variation which is haphazard or random so that the observations are 
individually unpredictable. In many of these cases the observations fall into 
certain classes wherein the relative frequencies are quite stable. This suggests 
that we postulate a number p, called the probability of the event, and approximate 
p by the relative frequency with which the repeated observations satisfy the 


Table 2 RESULTS OF TOSSING A DIE 300 TIMES 


Outcome 

Observed 

Frequency 

Observed 

relative frequency 

Long-run expected 
relative frequency 
of a balanced die 

1 

51 

.170 

.1667 

2 

54 

.180 

.1667 

3 

48 

.160 

.1667 

4 

51 

.170 

.1667 

5 

49 

.163 

.1667 

6 

47 

.157 

.1667 

Total 

300 

1.000 

1.000 
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event. For instance, suppose that the experiment consists of sampling the 
population of a large city to see how many voters favor a certain proposal. 
The outcomes are “favor” or “do not favor,” and each voter’s response is un¬ 
predictable, but it is reasonable to postulate a number p as the probability that 
a given response will be “ favor.” The relative frequency of “ favor ” responses 
can be used as an approximate value for p. 

As another example, suppose that the experiment consists of sampling 
transistors from a large collection of transistors. We shall postulate that the 
probability of a given transistor being defective is p. We can approximate p by 
selecting several transistors at random from the collection and computing the 
relative frequency of the number defective. 

The important thing is that we can conceive of a series of observations or 
experiments under rather uniform conditions. Then a number p can be postu¬ 
lated as the probability of the event A happening, and p can be approximated by 
the relative frequency of the event A in a series of experiments. 


3 PROBABILITY—AXIOMATIC 
3.1 Probability Models 

One of the aims of science is to predict and describe events in the world in which 
we live. One way in which this is done is to construct mathematical models 
which adequately describe the real world. For example, the equation s = \gt 2 
expresses a certain relationship between the symbols s, g, and t. It is a mathe¬ 
matical model. To use the equation s = \gt 2 to predict s, the distance a body 
falls, as a function of time t, the gravitational constant g must be known. The 
latter is a physical constant which must be measured by experimentation if the 
equation s = \gt 2 is to be useful. The reason for mentioning this equation is 
that we do a similar thing in probability theory; we construct a probability 
model which can be used to describe events in the real world. For example, it 
might be desirable to find an equation which could be used to predict the sex of 
each birth in a certain locality. Such an equation would be very complex, and 
none has been found. However, a probability model can be constructed which, 
while not very helpful in dealing with an individual birth, is quite useful in 
dealing with groups of births. Therefore, we can postulate a number p which 
represents the probability that a birth will be a male. From this fundamental 
probability we can answer questions such as: What is the probability that in 
ten births at least three will be males ? Or what is the probability that there will 
be three consecutive male births in the next five? To answer questions such as 
these and many similar ones, we shall develop an idealized probability model. 
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The two general types of probability (a priori and a posteriori) defined 
above have one important thing in common: They both require a conceptual 
experiment in which the various outcomes can occur under somewhat uniform 
conditions. For example, repeated tossing of a coin for the a priori case, and 
repeated birth for the a posteriori case. However, we might like to bring into 
the realm of probability theory situations which cannot conceivably fit into the 
framework of repeated outcomes under somewhat similar conditions. For 
example, we might like to answer questions such as: What is the probability my 
wife loves me? Or what is the probability that World War III will start before 
January 1, 1985? These types of problems are certainly a legitimate part of 
general probability theory and are included in what is referred to as subjective 
probability. We shall not discuss subjective probability to any great extent in 
this book, but we remark that the axioms of probability from which we develop 
probability theory are rich enough to include a priori probability, a posteriori 
probability, and subjective probability. 

To start, we require that every possible outcome of the experiment under 
study can be enumerated. For example, in the coin-tossing experiment there are 
two possible outcomes: heads and tails. We shall associate probabilities only 
with these outcomes or with collections of these outcomes. We add, however, 
that even if a particular outcome is impossible, it can be included (its probability 
is 0). The main thing to remember is that every outcome which can occur 
must be included. 

Each conceivable outcome of the conceptual experiment under study will be 
defined as a sample point , and the totality of conceivable outcomes (or sample 
points) will be defined as the sample space. 

Our object, of course, is to assess the probability of certain outcomes or 
collections of outcomes of the experiment. Discussion of such probabilities 
is conveniently couched in the language of set theory, an outline of which 
appears in the next subsection. We shall return to formal definitions and 
examples of sample space, event, and probability. 


3.2 An Aside—Set Theory 

We begin with a collection of objects. Each object in our collection will be 
called a point or element. We assume that our collection of objects is large 
enough to include all the points under consideration in a given discussion. 
The totality of all these points is called the space, universe, or universal set. 
We will call it the space (anticipating that it will become the sample space when 
we speak of probability) and denote it by Q, Let to denote an element or point 
in Q. Although a set can be defined as any collection of objects, we shall 
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assume, unless otherwise stated, that all the sets mentioned in a given discussion 
consist of points in the space £2. 


EXAMPLE 1 £2 = R 2 , where R 2 is the collection of points co in the plane and 

co = ( x, y) is any pair of real numbers x and y. //// 

EXAMPLE 2 £2 = {all United States citizens}. //// 

We shall usually use capital Latin letters from the beginning of the 
alphabet, with or without subscripts, to denote sets. If co is a point or element 
belonging to the set A, we shall write cue A\ if co is not an element of A, we 
shall write co $ A. 

Definition 2 Subset If every element of a set A is also an element of a 
set B, then A is defined to be a subset of B, and we shall write A <= B or 
B => A; read “A is contained in B ” or “£ contains A .” //// 

Definition3 Equivalent sets Two sets A and B are defined to be equiva¬ 
lent, or equal, if A <= B and B<= A. This will be indicated by writing 
A = B. //// 

Definition 4 Empty set If a set A contains no points, it will be called 
the null set, or empty set, and denoted by <j>. //// 

Definition 5 Complement The complement of a set A with respect to 
the space £2, denoted by A, A c , or £2-/1, is the set of all points that are in 
£2 but not in A. //// 

Definition 6 Union Let A and B be any two subsets of £2; then the 
set that consists of all points that are in A or B or both is defined to be 
the union of A and B and written A u B. HI I 

Definition 7 Intersection Let A and B be any two subsets of £2; then 
the set that consists of all points that are in both A and B is defined to be 
the intersection of A and B and is written A n B or AB. HH 

Definition 8 Set difference Let A and B be any two subsets of £2. The 
set of all points in A that are not in B will be denoted by A — B and is 
defined as set difference. HH 
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EXAMPLE 3 Let £2 = {(x, y)\ 0 < * ^ 1 and 0 < y < 1}, which is read the 
collection of all points (x, y) for which 0 < x < 1 and 0 < y < 1. Define 
the following sets: 

A y = {(X, y): 0 < x < 1; 0 < y < i}, 

A 2 = {(x, y): 0 < x < 0 < y < 1}, 

A 3 = {(x,y):0<x<y< 1}, 

A 4 = {(x, y ): 0 < x < i; 0 < y < i>. 

(We shall adhere to the practice initiated here of using braces to embrace 
the points of a set.) 

The set relations below follow. 

c; Ay', A 4 cz A 2 ; Ay n A 2 = AyA 2 = A^', 

A 2 u A 3 = A 4 u A 3 ; Ay ={(x, j>): 0 < x < 1; \ < y < 1}; 

Ai ~A 4 = {(x,y):± <x< l;0<>-<4}. //// 

EXAMPLE 4 Let £2, Ay, A 2 , and A 3 be as indicated in the diagrams in Fig. 1 
which are called Venn diagrams. //// 

The set operations of complement, union, and intersection have been 
defined in Definitions 5 to 7, respectively. These set operations satisfy quite a 
number of laws, some of which follow, stated as theorems. Proofs are omitted. 

Theorem 1 Commutative laws A u J3 = B u A and A n B = B n A. 

1111 

Theorem 2 Associative laws A u (B u C) = (A u B) u C, and 


A n (B n C) = (A n B) n C. //// 

Theorem 3 Distributive laws A n (B v C) = (A n B) v (A n C), and 
A u (B n C) = (A u B) n (A u C). //// 

Theorem 4 (A c ) c = (A) = A; in words, the complement of A comple¬ 
ment equals A. //// 

Theorem 5 A Q = A; A u D = C2; A<f> = 4>\ and Avj <j> = A. //// 

Theorem 6 AA = 4>; A u A = £2; A n A = A; and A u A = A. //// 
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Theorem 7 (A u B) = I n B, and (A n B) = A u B. These are known 
as De Morgan’s laws. //// 

Theorem 8 A - B = AB. Ill I 

Several of the above laws are illustrated in the Venn diagrams in Fig. 1. 
Although we will feel free to use any of the above laws, it might be instructive 
to give a proof of one of them just to illustrate the technique. For example, 
let us show that (A <u B) = A n B. By definition, two sets are equal if each is 
contained in the other. We first show that (A u B)<=A n B by proving that if 
co e A u B, then co e A n B. Now cn e (,4 u B) implies to A u B, which implies 
that co 4 A and c o $ B, which in turn implies that co e A and co e B ; that is, 
coeAnB. We next show that ,4 n B a (A u B). Let co e A n B, which means 
co belongs to both A and B. Then co $ A u B for if it did, co must belong to at 
least one of A or B, contradicting that co belongs to both A and B ; however, 
co $ A u B means co e (,4 u B), completing the proof. 

We defined union and intersection of two sets; these definitions extend 
immediately to more than two sets, in fact to an arbitrary number of sets. It 
is customary to distinguish between the sets in a collection of subsets of £1 by 
assigning names to them in the form of subscripts. Let A (Greek letter capital 
lambda) denote a catalog of names, or indices. A is also called an index set. 
For example, if we are concerned with only two sets, then our index set A 
includes only two indices, say I and 2; so A = {I, 2}. 

Definition 9 Union and intersection of sets Let A be an index set and 
{A x : Xe A) = {A x }, a collection of subsets of Q indexed by A. The set 
of points that consists of all points that belong to A k for at least one X is 
calledtheumonof thesets {A x } and is denoted by (J A x . The set of points 

that consists of all points that belong to A x for every X is called the inter¬ 
section of the sets {A x } and is denoted by f] A x . If A is empty, then define 

UA 

{J d x = (j) and f]A x = n. //// 

UA UA 

EXAMPLE 5 If A = {1, 2, ..., N}, i.e., A is the index set consisting of the 
first N integers, then (J A x is also written as 

X e A 

N 

(J A„ = Ai u A 2 u ••• u A n . 

n= 1 


//// 
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One of the most fundamental theorems relating unions, intersections, and 
complements for an arbitrary collection of sets is due to De Morgan. 

Theorem 9 De Morgan’s theorem Let A be an index set and {A,} a 
collection of subsets of Q indexed by A. Then, 

(i) [j^x = f]^x- 

JieA XeA 

(ii) (Va= \JAx- /Ill 

IeA IeA 

We will not give a proof of this theorem. Note, however, that the special 
case when the index set A consists of only two names or indices is Theorem 7 
above, and a proof of part of Theorem 7 was given in the paragraph after 
Theorem 8. 

Definition 10 Disjoint or mutually exclusive Subsets A and B of Cl are 

defined to be mutually exclusive or disjoint if A n B = <p. Subsets 
A x , A 2 ,... are defined to be mutually exclusive if A f Aj — (j> for every i # j. 

mi 

Theorem 10 If A and B are subsets of C2, then (i) A = A B u AB, and 
(ii) AB n AB = <p. 

proof (i) A = An£l = An(B'uB) = ABv AB. (ii) AB n AB 
= A ABB = A(j> = (j). HI! 

Theorem 11 If A <= B, then AB = A, and A u B = B. 

proof Left as an exercise. //// 


3.3 Definitions of Sample Space and Event 

In Subsec. 3.1 we described what might be meant by a probability model. 
There we said that we had in mind some conceptual experiment whose possible 
outcomes we would like to study by assessing the probability of certain outcomes 
or collection of outcomes. In this subsection, we will give two important 
definitions, along with some examples, that will be used in assessing these 
probabilities. 

Definition 11 Sample space The sample space, denoted by Q, is the 
collection or totality of all possible outcomes of a conceptual experiment. 

nn 
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One might try to understand the definition by looking at the individual 
words. Use of the word “ space” can be justified since the sample space is the 
total collection of objects or elements which are the outcomes of the experiment. 
This is in keeping with our use of the word “ space ” in set theory as the collec¬ 
tion of all objects of interest in a given discussion. The word “sample” is 
harder to justify; our experiment is random, meaning that its outcome is un¬ 
certain so that a given outcome is just one sample of many possible outcomes. 

Some other symbols that are used in other texts to denote a sample space, 
in addition to Q, are S, Z, R, E, X, and A. 

Definition 12 Event and event space An event is a subset of the sample 
space. The class of all events associated with a given experiment is 
defined to be the event space. //// 

The above does not precisely define what an event is. An event will 
always be a subset of the sample space, but for sufficiently large sample spaces 
not all subsets will be events. Thus the class of all subsets of the sample space 
will not necessarily correspond to the event space. However, we shall see that 
the class of all events can always be selected to be large enough so as to include 
all those subsets (events) whose probability we may want to talk about. If the 
sample space consists of only a finite number of points, then the corresponding 
event space will be the class of all subsets of the sample space. 

Our primary interest will not be in events per se but will be in the prob¬ 
ability that an event does or does not occur or happen. An event A is said to 
occur if the experiment at hand results in an outcome (a point in our sample 
space) that belongs to A. Since a point, say cu, in the sample space is a subset 
(that subset consisting of the point cu) of the sample space Q, it is a candidate to 
be an event. Thus cu can be viewed as a point in fi or as a subset of Q. To 
distinguish, let us write {cu}, rather than just cu, whenever cu is to be viewed as a 
subset of Q. Such a one-point subset will always be an event and will be called 
an elementary event. Also (j> and fi are both subsets of Q, and both will always 
be events. Q is sometimes called the sure event. 

We shall attempt to use only capital Latin letters (usually from the begin¬ 
ning of the alphabet), with or without affixes, to denote events, with the excep¬ 
tion that (j) will be used to denote the empty set and Q the sure event. The event 
space will always be denoted by a script Latin letter, and usually $t. 3% and J 5 ", 

as well as other symbols, are used in some texts to denote the class of all events. 

The sample space is basic and generally easy to define for a given experi¬ 
ment. Yet, as we shall see, it is the event space that is really essential in de¬ 
fining probability. Some examples follows. 
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EXAMPLE 6 The experiment is the tossing of a single die (a regular six-sided 
polyhedron or cube marked on each face with one to six spots) and noting 
which face is up. Now the die can land with any one of the six faces up; 
so there are six possible outcomes of the experiment; 


n = { □, ED, ED, ED, 0, HH}- 


Let A = {even number of spots up}. A is an event; it is a subset of Cl. 
^4 = {ED, [U3,HD}- Let Ai = {/ spots up}; i = 1, 2,.6. Each A, is an 
elementary event. For this experiment the sample space is finite; hence 
the event space is all subsets of Q. There are 2 6 = 64 events, of which only 
6 are elementary, in stf (including both (j> and Q). See Example 19 of 
Subsec. 3.5, where a technique for counting the number of events in a finite 
sample space is presented. //// 


EXAMPLE 7 Toss a penny, nickel, and dime simultaneously, and note which 
side is up on each. There are eight possible outcomes of this experiment. 
Cl ={(H, H, H), (H, H, T), (H, T, H), (T, H, H), (H, T, T), (T, H, T), 
(T, T, H), (T, T, T)}. We are using the first position of (•, •, •). called a 
3 -tuple, to record the outcome of the penny, the second position to record 
the outcome of the nickel, and the third position to record the outcome of 
the dime. Let A t = {exactly i heads}; i = 0, 1,2, 3. For each j, A t is an 
event. Note that A 0 and A 3 are each elementary events. Again all 
subsets of Q are events; there are 2 8 = 256 of them. //// 


EXAMPLE 8 The experiment is to record the number of traffic deaths in the 
state of Colorado next year. Any nonnegative integer is a conceivable 
outcome of this experiment; so Q = {0, I, 2, ...}. A = {fewer than 500 
deaths} = {0, I, ..., 499} is an event. A x = {exactly i deaths}, i = 0, 1, 
..., is an elementary event. There is an infinite number of points in the 
sample space, and each point is itself an (elementary) event; so there is an 
infinite number of events. Each subset of Q is an event. //// 


EXAMPLE 9 Select a light bulb, and record the time in hours that it burns 
before burning out. Any nonnegative number is a conceivable outcome 
of this experiment; so fi = {x:: x > 0}. For this sample space not all 
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subsets of O are events; however, any subset that can be exhibited will be 
an event. For example, let 

A = {bulb burns for at least k hours but burns out before m 
hours} 

= {x: k < x < m}; 

then A is an event for any 0 <k <m. HU 


EXAMPLE 10 Consider a random experiment which consists of counting 
the number of times that it rains and recording in inches the total rainfall 
next July in Fort Collins, Colorado. The sample space could then be 
represented by 

fl = {(i, x): i = 0, 1,2,... and 0 < x}, 

where in the 2-tuple (•, •) the first position indicates the number of times 
that it rains and the second position indicates the total rainfall. For 
example, co = (7, 2.251) is a point in Q corresponding to there being seven 
different times that it rained with a total rainfall of 2.251 inches. A = 
{(i, x): f=5, ..., 10 and x > 3} is an example of an event. //// 


EXAMPLE 11 In an agricultural experiment, the yield of five varieties of 
wheat is examined. The five varieties are all grown under rather uniform 
conditions. The outcome is a collection of five numbers (y u y 2 ,y$ ,y 4 ,y 5 ), 
where represents the yield of the ith variety in bushels per acre. Each 
y t can conceivably be any real number greater than or equal to 0. In this 
example let the event A be defined by the conditions that y 2 , y 3 , y 4 , and 
y$ are each 10 or more bushels per acre larger than y l , the standard 
variety. In our notation we write 

a = {Oi, y 2 , Ts» y *, Ts): y s ^ y\ + = 2,3,4, 5 ; 0 < ^}. //// 

Our definition of sample space is precise and satisfactory, whereas our 
definitions of event and event space are not entirely satisfactory. We said that 
if the sample space was “ sufficiently large ” (as in Examples 9 to 11 above), not all 
subsets of the sample space would be events; however, we did not say exactly 
which subsets would be events and which would not. Rather than developing 
the necessary mathematics to precisely define which subsets of Q constitute our 
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event space si, let us state some properties of si that it seems reasonable to 
require: 

(i) £2 e si. 

(ii) If A e si, then Ae si. 

(iii) If A l and A 2 e si, then A t u A 2 e si. 

We said earlier that we were interested in events mainly because we would 
be interested in the probability that an event happens. Surely, then, we would 
want si to include £2, the sure event. Also, if A is an event, meaning we can 
talk about the probability that A occurs, then A should also be an event so that 
we can talk about the probability that A does not occur. Similarly, if A { and 
A 2 are events, so should A t u A 2 be an event. 

Any collection of events with properties (i) to (iii) is called a Boolean 
algebra , or just algebra, of events. We might note that the collection of all 
subsets of £2 necessarily satisfies the above properties. Several results follow 
from the above assumed properties of si. 

Theorem 12 (j>e si. 

proof By property (i) £2 e si; by (ii) Qe si ; but £2 = (j>; so <j> e si. 

an 


Theorem 13 If A t and A 2 e si, then A l n A 2 e si. 

proof li and A 2 e si; hence A t u A 2 , and (A t u A 2 ) e si, but 

(A } u A 2 ) = A x n A 2 = A x n A 2 by De Morgan’s law. //// 

n n 

Theorem 14 If A u A 2 , ..., A„ e si, then (J A { and f) A t e si. 

:= l i=i 

proof Follows by induction. //// 

We will always assume that our collection of events si is an algebra— 
which partially justifies our use of si as our notation for it. In practice, one 
might take that collection of events of interest in a given consideration and 
enlarge the collection, if necessary, to include (i) the sure event, (ii) all comple¬ 
ments of events already included, and (iii) all finite unions and intersections of 
events already included, and this will be an algebra si. Thus far, we have not 
explained why si cannot always be taken to be the collection of all subsets of £2. 
Such explanation will be given when we define probability in the next subsection. 
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3.4 Definition of Probability 

In this section we give the axiomatic definition of probability. Although this 
formal definition of probability will not in itself allow us to achieve our goal of 
assigning actual probabilities to events consisting of certain outcomes of random 
experiments, it is another in a series of definitions that will ultimately lead to 
that goal. Since probability, as well as forthcoming concepts, is defined as a 
particular function, we begin this subsection with a review of the notion of a 
function. 

The definition of a function The following terminology is frequently used 
to describe a function: A function, say /(•), is a rule (law, formula, recipe) that 
associates each point in one set of points with one and only one point in another 
set of points. The first collection of points, say A, is called the domain , and the 
second collection, say B, the counterdomain. 

Definition 13 Function A function, say /( •), with domain A and coun¬ 
terdomain B, is a collection of ordered pairs, say (a, b), satisfying (i) a e A 
and be B\ (ii) each a e A occurs as the first element of some ordered pair 
in the collection (each be B is not necessarily the second element of some 
ordered pair); and (iii) no two (distinct) ordered pairs in the collection 
have the same first element. //// 

if ( a, b) e /(•), we write b = f{a) (read “ b equals / of a”) and call /(a) 
the value of/(•) at a. For any a e A, f(a) is an element of 5; whereas/(•) is 
a set of ordered pairs. The set of all values of/(•) is called the range of/(•); 
i.e., the range of /(•) = {b e B: b = f(a) for some a e A} and is always a subset 
of the counterdomain B but is not necessarily equal to it. f(a) is also called the 
image of a under /(•), and a is called the preimage of /(a). 


EXAMPLE 12 Let/^*) and f 2 (‘) be the two functions, having the real line 
for their domain and counterdomain, defined by 

/i(-) = {(x, y): y = x 3 +x+ 1, -oo < jc< oo} 


and 

/z( •) = {(*> y) '• y = x 2 , - oo < x < oo}. 

The range of/i() is the counterdomain, the whole real line, but the range 
°f fi(') i s a11 nonnegative real numbers, not the same as the counter¬ 
domain. //// 
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Of particular interest to us will be a class of functions that are called 
indicator functions. 


Definition 14 Indicator function Let Q be any space with points co 
and A any subset of Q. The indicator function of A, denoted by I A ( •), 
is the function with domain Q and counterdomain equal to the set consist¬ 
ing of the two real numbers 0 and 1 defined by 


'-<»>-{J 

I A (') clearly “indicates” the set A. 


if co e A 
if co^A. 


mi 


Properties of Indicator Functions Let Q be any space and sd any collection 
of subsets of fl: 


(i) I A (cn) = 1 — / A (co) for every A e sd. 

(“) Ia^-aS-Oo) = • 7 a 2 («) • * • IaS«>) forA u ...,A n esd. 

(Hi) I Al UA2U Ui4„ (co) = max [/ Al (co), Ia 2 (^)< ■ ■■< j a„(w)] for A ,,..., 
A n es/. 

(iv) /i(co) = / A (co) for every Ae sd. 


Proofs of the above properties are left as an exercise. 

The indicator function will be used to “indicate” subsets of the real line; 
e.g., 


hio, d}( x ) — ho, i>( x ) — |q 
and if / + denotes the set of positive integers, 


if 0 < x < 1 
otherwise, 


T ( \ _P if x is some positive integer 

/+ X |o otherwise. 

Frequent use of indicator functions will be made throughout the remainder 
of this book. Often the utility of the indicator function is just notational 
efficiency as the following example shows. 


EXAMPLE 13 Let the function/(•) be defined by 


'0 

for 

x < 0 

X 

for 

0 < x < 1 

2 — x 

for 

1 < x < 2 

lo 

for 

2 < x. 


/(*) = 
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By using the indicator function,/(x) can be written as 

f(x) = x/ (0> ! j(x) + (2 - x)/ (1 , 2 ](*)> 
or also by using the absolute value symbol as 

/(*) = (! -|1 -x|)/ (0 , 2] (x). //// 


Another type of function that we will have occasion to discuss is the set 
function defined as any functon which has as its domain a collection of sets 
and as its counterdomain the real line including, possibly, infinity. Examples 
of set functions follow. 


EXAMPLE 14 Let £2 be the sample space corresponding to the experiment of 
tossing two dice, and let si be the collection of all subsets of £1. For 
any A e si define N(A) = number of outcomes, or points in £2, that are 
in A. Then N{f) = 0, N(Q) = 36, and N(A) = 6 if A is the event con¬ 
taining those outcomes having a total of seven spots up. //// 


The size-of-set function alluded to in the above example can be defined, in 
general, for any set A as the number of points in A, where A is a member of an 
arbitrary collection of sets si. 


EXAMPLE 15 Let £2 be the plane or two-dimensional euclidean space and 
si any collection, of subsets of £2 for which area is meaningful. Then 
for any A e si define Q(A) = area of A. For example, if A = {(x,y): 0 
< x < 1, 0 ^ y < 1}, then Q(A) = 1; if A = {(x, y): x 2 + y 2 < r 2 }, then 
Q(A) = nr 2 ; and if A = {(0, 0), (1, 1)} then Q(A) = 0. //// 


The probability function to be defined will be a particular set function. 


Probability function Let £2 denote the sample space and si denote a collec¬ 
tion of events assumed to be an algebra of events (see Subsec. 3.3) that we shall 
consider for some random experiment. 
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o Definition 15 Probability function A probability function P[- ] is a set 
function with domain si (an algebra of events)* and counterdomain the 
interval [0, 1] which satisfies the following axioms: 

(i) P[A] > 0 for every A e si. 

(ii) P[ Q]=l. 

(iii) If A it A 2 , ... is a sequence of mutually exclusive events in si 
(that is, Ai n Aj = 4> for i # j; i,j = 1,2,...) and if A x u A 2 u • • • = 

U A t esi, then p\ Q a] = £ P[A t l //// 

i=l Li = 1 J i=l 

These axioms are certainly motivated by the definitions of classical and 
frequency probability. This definition of probability is a mathematical defini¬ 
tion; it tells us which set functions can be called probability functions; it does not 
tell us what value the probability function />[•] assigns to a given event A. We 
will have to model our random experiment in some way in order to obtain values 
for the probability of events. 

P[A] is read “ the probability of event A” or “ the probability that event A 
occurs,” which means the probability that any outcome in A occurs. 

We have used brackets rather than parentheses in our notation for a 
probability function, and we shall continue to do the same throughout the 
remainder of this book. 


*In defining a probability function, many authors assume that the domain of the set 
function is a sigma-algebra rather than just an algebra. For an algebra stf , we had the 
property 

if Ai and A 2 e s/, then A 2 u^e . 

A sigma-algebra differs from an algebra in that the above property is replaced by 

00 

it A u A 2 . A„,...es/, then 

n = 1 

It can be shown that a sigma-algebra is an algebra, but not necessarily conversely. 
If the domain of the probability function is taken to be a sigma-algebra then axiom 
(iii) can be simplified to 


[ CO ] co 

kJ Ai =^P[At], 

A fundamental theorem of probability theory, called the extension theorem, states that 
if a probability function is defined on an algebra (as we have done) then it can be 
extended to a sigma-algebra. Since the probability function can be extended from an 
algebra to a sigma-algebra, it is reasonable to begin by assuming that the probability 
function is defined on a sigma-algebra. 
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EXAMPLE 16 Consider the experiment of tossing two coins, say a penny and 
a nickel. Let n = {(H, H ), (H, T), (T, H), (T, T)} where the first com¬ 
ponent of (•, •) represents the outcome for the penny. Let us model this 
random experiment by assuming that the four points in fi are equally 
likely; that is, assume P[{{H, #)}] = ^)}1 = P[{(T, #)}] = 

P[{{T, T)}]. The following question arises: Is the P[' ] function that is 
implicitly defined by the above really a probability function; that is, does 
it satisfy the three axioms ? It can be shown that it does, and so it is 
a probability function. 


In our definitions of event and si, a collection of events, we stated that si 
cannot always be taken to be the collection of all subsets of fi. The reason for 
this is that for “ sufficiently large” Q the collection of all subsets of fl is so large 
that it is impossible to define a probability function consistent with the above 
axioms. 

We are able to deduce a number of properties of our function P[- ] from 
its definition and three axioms. We list these as theorems. 

It is in the statements and proofs of these properties that we will see the 
convenience provided by assuming si is an algebra of events, si is the domain 
of /*[•]; hence only members of si can be placed in the dot position of the 
notation P[- ]. Since si is an algebra, if we assume that A and Be si, we know 
that A, A u B, AB, AB, etc., are also members of si, and so it makes sense to 
talk about P[A], P[A u B], P[AB], P[%B], etc. 

Properties of P[‘] For each of the following theorems, assume that fi and 
si (an algebra of events) are given and P[-] is a probability function having 
domain si. 


Theorem 15 P[<t>] = 0. 

proof Take A t = 4>, A 2 = 4>, A 3 = (j>,.. then by axiom (iii) 


IM 

i= 1 


= Z p[a,i= Z p [<t>l 


pW = p[ 

which can hold only if P[(f>] = 0. //// 

Theorem 16 If A t ,..., A„ are mutually exclusive events in si, then 


/»[><! u ••• u ,4.] = £^,]. 

i= 1 
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proof Let A n+l = cj), A n+2 = <j>, ; then (J A, = (J A t e s4, 

i=i ; = i 

and 

=p\ 0^1 = I = t P\A& 11II 

_i=l Li=l J i= 1 £= 1 


Theorem 17 If A is an event in s4, then 

P[A] = \-P\A}. 

proof A u A = Q, and A n A = <j>; so 

P[Q] = P[A u A] = P[A] + P[A]. 

But P[D] = 1 by axiom (ii); the result follows. //// 

Theorem 18 If^4 and Bes/, then P[A] = P[AB] + P[AB], and P[A — B] 
= P[AB] = P[A] - P[AB]. 

proof A = AB u AB, and AB n AB = <j); so P[A] = P[AB] + 
P[AB]. HU 

Theorem 19 For every two events A and B e s4, P[A u B] = P[A ] 
+ P[B ] — P[AB], More generally, for events A u A 2 , ..., A n e stf 

P[A, u A 2 u • • • u A„] = t P[A 3 ] - Z £ PlA.Aj] 

j= 1 KJ 

+ Y J llP[ A iA j A k ]- +(- l)- +1 /T^ a ...i4J. 

i<j<k 

proof iuB = iu >45, and /I n ,4B = </>; so 
FU u'B] = f[/l] + f[lB] 

= /»[><]+/»[!!]-/>[>«]. 

The more general statement is proved by mathematical induction. (See 
Problem 16.) //// 

Theorem 20 If A and B e s/ and A <= B, then P[A\ < P[B]. 


proof B = BA u BA, and BA = A ; so B = A u BA, and A n BA = 
<fr; hence P[B] = P[A] + P[BA], The conclusion follows by noting that 
P[BA] > 0. //// 
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Theorem 21 Boole’s inequality If A u A z , ■ ■ ■ , A n e si, then 
P\Ai u A 2 u • • • vj A n ] < P\_A\] + P[/4 2 ] + ''' + P[A n ]* 

proof P[A ! u ^ 2 ] = + F[,4 2 ] -P[/M 2 ] <F[A] + F[,4 2 ]. 

Theproofis completed using mathematical induction. //// 

We conclude this subsection with one final definition. 

Definition 16 Probability space A probability space is the triplet 
(Q, si, P [']), where Q is a sample space, si is a collection (assumed to be 
an algebra) of events (each a subset of Cl), and /*[•] is a probability func¬ 
tion with domain si. //// 

Probability space is a single term that gives us an expedient way to assume 
the existence of all three components in its notation. The three components are 
related; si is a collection of subsets of Cl, and P[* ] is a function that has si as its 
domain. The probability space’s main use is in providing a convenient method 
of stating background assumptions for future definitions, theorems, etc. It also 
ties together the main definitions that we have covered so far, namely, definitions 
of sample space, event space, and probability. 


3.5 Finite Sample Spaces 

In previous subsections we formally defined sample space, event, and probability, 
culminating in the definition of probability space. We remarked there that these 
formal definitions did not in themselves enable us to compute the value of the 
probability-for an event A, which is our goal. We said that we had to appro¬ 
priately model the experiment. In this section we show how this can be done 
for finite sample spaces, that is, sample spaces with only a finite number of 
elements or points in them. 

In certain kinds of problems, of which games of chance are notable 
examples, the sample space contains a finite number of points, say N = N(C1). 
[Recall that N(A) is the size of A, that is, the number of sample points in A.] 
Some of these problems can be modeled by assuming that points in the sample 
space are equally likely. Such problems are the subject to be discussed next. 

Finite sample space with equally likely points For certain random 
experiments there is a finite number of outcomes, say N, and it is often realistic 
to assume that the probability of each outcome is l/N. The classical definition 
of probability is generally adequate for these problems, but we shall show how 
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the axiomatic definition is applicable as well. Let co l , a > 2 , c% be the N 
sample points in a finite space Q. Suppose that the set function /*[•] with 
domain the collection of all subsets of O satisfies the following conditions: 

(i) /»[{«»,}] =/»[{o> a }] = -=/»[{©,}]. 

(ii) If A is any subset of Q which contains N(A) sample points [has size 
N(A)], then P[A] = N(A)/N. 

Then it is readily checked that the set function /*[•] satisfies the three axioms 
and hence is a probability function. 

Definition 17 Equally likely probability function The probability func¬ 
tion P[- ] satisfying conditions (i) and (ii) above is defined to be an equally 
likely probability function. ffjf 

Given that a random experiment can be realistically modeled by assuming 
equally likely sample points, the only problem left in determining the value of 
the probability of event A is to find N(Q) = N and N(A). Strictly speaking this 
is just a problem of counting—count the number of points in A and the number 
of points in Q. 


EXAMPLE 17 Consider the experiment of tossing two dice (or of tossing one 
die twice). LetQ = {(ij, if): tj = 1, 2,..., 6; i 2 = 1, 2,..., 6}. Here ij = 
number of spots up on the first die, and i 2 = number of spots up on the sec¬ 
ond die. There are 6-6 = 36 sample points. It seems reasonable to attach 
the probability of to each sample point. Q can be displayed as a lattice 
as in Fig. 2. Let A-, = event that the total is 7; then A-, = {(1, 6), (2, 5), 
(3,4), (4, 3), (5, 2), (6, 1)}; so N(Af) = 6, and P[A n ] = N(A 2 )/N(Ci) = = 

£. Similarly P[Af\ can be calculated for Aj = total of j;j = 2,..., 12. In 
this example the number of points in any event A can be easily counted, 
and so P[A] can be evaluated for any event A. //// 

If N(A) and N(ki) are large for a given random experiment with a finite 
number of equally likely outcomes, the counting itself can become a difficult 
problem. Such counting can often be facilitated by use of certain combinatorial 
formulas, some of which will be developed now. 

Assume now that the experiment is of such a nature that each outcome 
can be represented by an n-tuple. The above example is such an experiment; 
each outcome was represented by a 2-tuple. As another example, if the ex¬ 
periment is one of drawing a sample of size n, then n-tuples are particularly 
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useful in recording the results. The terminology that is often used to describe 
a basic random experiment known generally by sampling is that of balls and urns. 
It is assumed that we have an urn containing, say, M balls, which are numbered 
I to M. The experiment is to select or draw balls from the urn one at a time 
until n balls have been drawn. We say we have drawn a sample of size n. The 
drawing is done in such a way that at the time of a particular draw each of the 
balls in the urn at that time has an equal chance of selection. We say that a 
ball has been selected at random. Two basic ways of drawing a sample are 
with replacement and without replacement, meaning just what the words say. A 
sample is said to be drawn with replacement, if after each draw the ball drawn 
is itself returned to the urn, and the sample is said to be drawn without replace¬ 
ment if the ball drawn is not returned to the urn. Of course, in sampling without 
replacement the size of the sample n must be less than or equal to M, the original 
number of balls in the urn, whereas in sampling with replacement the size of 
sample may be any positive integer. In reporting the results of drawing a sample 
of size n, an «-tuple can be used; denote the n-tuple by (z l , ..., z„), where z t 
represents the number of the ball drawn on the ith draw. 

In general, we are interested in the size of an event that is composed of 
points that are n-tuples satisfying certain conditions. The size of such a set can be 
computed as follows: First determine the number of objects, say N t , that may be 
used as the first component. Next determine the number of objects, say N 2 , 
that may be used as the second component of an n-tuple given that the first com¬ 
ponent is known. (We are assuming that N 2 does not depend on which 
object has occurred as the first component.) And then determine the number of 
objects, say N 3 , that may be used as the third component given that the first 
and second components are known. (Again we are assuming N 3 does not 
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depend on which objects have occurred as the first and second components.) 
Continue in this manner until N n is determined. The size N(A ) of the set A of 
n-tuples then equals N t • N 2 * • • N n . 


EXAMPLE 18 The total number of different ordered samples of n balls that 
can be obtained by drawing balls from an urn containing M distinguish¬ 
able balls (distinguished by numbers 1 to M) is M " if the sampling is done 
with replacement and is M(M — 1) ••• (M - n + 1) if the sampling is 
done without replacement. An ordered sample can be represented by 
an n-tuple, say (z lt . .., z B ),whereZjisthe number of the ball obtained on the 
y'th draw and the total number of different ordered samples is the same as 
the total number of n-tuples. In sampling with replacement, there are M 
choices of numbers for the first component, M choices of numbers for the 
second component, and finally M choices for the nth component. Thus 
there are M" such n-tuples. In sampling without replacement, there are 
M choices of numbers for the first component, M — 1 choices for the 
second, M — 2 choices for the third, and finally M — n + 1 choices for 
the nth component. In total, then, there are M(M — 1)(M — 2) • • • 
( M — n + 1) such n-tuples. M(M — 1) • • • ( M — n + 1) is abbreviated 
( M) n (see Appendix A). //// 


EXAMPLE 19 Let S be any set containing M elements. How many subsets 
does S have? First let us determine the number of subsets of size n that 
S' has. Let x n denote this number, that is, the number of subsets of S of 
size n. A subset of size n is a collection of n objects, the objects not 
arranged in any particular order. For example the subset {5 lt s s , i 7 } is 
the same as the subset {s 5 , s t , s 7 } since they contain the same three objects. 
If we take a given subset of S which contains n elements, n! different 
ordered samples can be obtained by sampling from the given subset 
without replacement. If for each of the x„ different subsets there are «! 
different ordered samples of size n, then there are (n\)x„ different ordered 
samples of size n in sampling without replacement from the set S of M 
elements. But we know from the previous example that this number is 
(M)„; hence (n\)x„ = (M) n , or 

^ = number of subsets of size n that may be formed 



from the elements of a set of size M. 


(1) 
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The total number of subsets of S, where S is a set of size M, 



This includes the empty set (set with no elements in it) and the whole set, 
both of which are subsets. Using the binomial theorem (see Appendix 

A) 


(a +b) M = V ( M )a n b M - n , 

with a = b — 1, 


n=0\« / 



we see that 




2 “- 

(2) 

thus a set of size M has 2 M subsets. 


//// 


EXAMPLE 20 Suppose an urn contains M balls numbered 1 to M, where the 
first AT balls are defective and the remaining M — K are nondefective. 
The experiment is to draw n balls from the urn. Define A k to be the event 
that the sample of n balls contains exactly k defectives. There are two 
ways to draw the sample: (i) with replacement and (ii) without replace¬ 
ment. We are interested in P[A k ] under each method of sampling. Let 
our sample space fi = {(z lt ..., z n ) : Zj = number of the ball drawn on the 
yth draw}. Now 


P[A k ] 


N(A k ) 
N{&) ' 


From Example 18 above, we know N(Q) = M" under (i) and yV(Q) = (M)„ 
under (ii). A k is that subset of fi for which exactly k of the zf s are ball 
numbers 1 to K inclusive. These k ball numbers must fall in some subset 
of k positions from the total number of n available positions. There are 

ways of selecting the k positions for the ball numbers 1 to A" inclusive to 

fall in. For each of the different positions, there are K k (M - AT)" - * 
different n-tuples for case (i) and (K) k (M — K) n _ k different n-tuples for 
case (ii). Thus A k has size — K) n ~ k for case (i) and size 


(fy(K) k (M - K)„_* for case (ii); so. 


P[A k \ = 


(") K\M-Ky- k 


(3) 


M" 



30 PROBABILITY 


I 


in sampling with replacement, and 


P[A k ] 


(”)(X) t (M - K)„_ t 

lM)n 


(4) 


in sampling without replacement. This latter formula can be rewritten 
as 


P[A k ] = 




(5) 


It might be instructive to derive Eq. (5) in another way. Suppose 
that our sample space, denoted now by O', is made up of subsets of size 
«, rather than n-tuples; that is, O' = {{z lt . .., z n }: z t , . .., z„ are the numbers 

on the n balls drawn}. There are subsets of size n of the M balls; 

so N(C1') = If d * s assumed that each of these subsets of size n is 

just as likely as any other subset of size n (one can think of selecting all n 
balls at once rather than one at a time), then P[A k ] = N(A k )/N(Q'). Now 
N(A!c) is the size of the event consisting of those subsets of size« which con¬ 
tain exactly k balls from the balls that are numbered 1 to K inclusive. 
The k balls from the balls that are numbered 1 to K can be selected in 

ways, and the remaining n — k balls from the balls that are numbered 
K + 1 to M can be selected in ^ ways; hence N(A’ k ) = 

(k)(^-f)’ andfinally 

P[A' k ] = N(A' k )/N(,£l') = X - L 


We have derived the probability of exactly k defectives in sampling 
without replacement by considering two different sample spaces; one 
sample space consisted of «-tuples, the other consisted of subsets of size 
n. 

To aid in remembering the formula given in Eq. (5), note that 
K + M— K = M and k + n — k = n; i.e., the sum of the “upper” terms 
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in the numerator equals the “upper” term in the denominator, and the 
sum of the “lower” terms in the numerator equals the “lower” term 
in the denominator. I III 


EXAMPLE 21 The formula given in Eq. (5) is particularly useful to calculate 
certain probabilities having to do with card games. For example, we 
might ask the probability that a certain 13-card hand contains exactly 
6 spades. There are M = 52 total cards, and one can model the card 
shuffling and dealing process by assuming that the 13-card hand represents 
a sample of size 13 drawn without replacement from the 52 cards. Let 
A 6 denote the event of exactly 6 spades. There are a total of 13 spades 
(defective balls in sampling terminology); so 


by Eq. (5). 



Many other formulas for probabilities of specified events defined on finite 
sample spaces with equally likely sample points can be derived using methods of 
combinatorial analysis, but we will not undertake such derivations here. The 
interested reader is referred to Refs. 10 and 8. 


Finite sample space without equally likely points We saw for finite sample 
spaces with equally likely sample points that P[A ] = N(A)/N(Q) for any event A. 
For finite sample spaces without equally likely sample points, things are not 
quite as simple, but we can completely define the values of P[A ] for each of the 
2 N(n) events A by specifying the value of P[- ] for each of the N = N(Q.) elemen¬ 
tary events. Let Q = {co l , ..., (%}, and assume pj =F[{co J }] for j = 1, ..., N. 
Since 


1 = F[n] = P 


UWI = ZF[{m,}], 
J =i J j= i 


i= i 


For any event A, define P[A] =^p } , where the summation is over those oij 
belonging to A. It can be shown that P[ • ] so defined satisfies the three axioms 
and hence is a probability function. 
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EXAMPLE 22 Consider an experiment that has N outcomes, say eo^ <x> 2 , .. 
co N , where it is known that outcome cUj+i is twice as likely as outcome 
a>j, where j = 1,..., N — 1; that is, p j+l = 2p } , where p t = P[{cuj}]. Find 
P[A k \, where A k = {a>i, co 2 , co*}. Since 

tpj= t2 J ~ i p 1 = p 1 (1 + 2 + 2 2 + • • • + 2 N ~ *) = Pi(2 N — 1) = 1, 

i =i i= i 

1 


and 

Pj = 2J- l /(2 N -l); 

hence 

P[A k . 1 = £>; = i 2 J - V(2 N - 1) = . //// 

J= 1 J= i z - 1 

c 3.6 Conditional Probability and Independence 

In the application of probability theory to practical problems it is not infrequent 
that the experimenter is confronted with the following situation: Such and such 
has happened; now what is the probability that something else will happen? 
For example, in an experiment of recording the life of a light bulb, one might 
be interested in the probability that the bulb will last 100 hours given that it 
has already lasted for 24 hours. Or in an experiment of sampling from a box 
containing 100 resistors of which 5 are defective, what is the probability that 
the third draw results in a defective given that the first two draws resulted in 
defectives? Probability questions of this sort are considered in the framework 
of conditional probability, the subject that we study next. 

Conditional probability We begin by assuming that we have a probability 
space, say (Q, s4, /*[•]); that is, we have at hand some random experiment for 
which a sample space Q, collection of events s/, and probability function 
/*[•] have all been defined. 

Given two events A and B, we want to define the conditional probability 
of event A given that event B has occurred. 

Definition 18 Conditional probability Let A and B be two events in sd 
of the given probability space (Q, si, P[- ]). The conditional probability 
of event A given event B, denoted by P[A \ B], is defined by 

p [A\B]=^^ if P[B]> 0, (6) 

PyB J 


and is left undefined if P[B] = 0. 


//// 
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Remark A formula that is evident from the definition is P[AB] — 
P[A\B]P[B]=P[B\A]P[A] if both P[A] and P[B] are nonzero. This 
formula relates P[A\B] to P[B\A] in terms of the unconditional prob¬ 
abilities P[A] and P[B], HU 


We might note that the above definition is compatible with the frequency 
approach to probability, for if one observes a large number, say N, of occur¬ 
rences of a random experiment for which events A and B are defined, then 
P[A 16] represents the proportion of occurrences in which B occurred that A also 
occurred, that is, 

P[A\B] = ^, 

™ B 


where N B denotes the number of occurrences of the event B in the N occur¬ 
rences of the random experiment and N AB denotes the number of occurrences 
of the event A n B in the N occurrences. Now P[AB] = N AB /N, and P[B] = 
N b /N ; so 


P[AB] N ab /N 
P[B] N b /N 


TT.- mn 


consistent with our definition. 


EXAMPLE 23 Let Q be any finite sample space, the collection of all subsets 
of Q, and P[- ] the equally likely probability function. Write N = N(Sl). 
For events A and B, 


P[A\B] 


P[AB] 
P[B] 


N(AB)/N 
N(B)/N ’ 


where, as usual, N(B) is the size of set B. So for any finite sample space 
with equally likely sample points, the values of P[A | B] are defined for any 
two events A and B provided P[B] > 0. //// 


EXAMPLE 24 Consider the experiment of tossing two coins. Let Q = 
{(LI, H), (H; T), (T, H), (T, T)}, and assume that each point is equally 
likely. Find (i) the probability of two heads given a head on the first 
coin and (ii) the probability of two heads given at least one head. Let 
A t = {head on first coin} and A 2 = {head on second coin}; then the prob¬ 
ability of two heads given a head on the first coin is 


P[A l A 2 \A l \ 


P[A,A 2 A,] 
P[A i] 


p\a,a 2 \ 
P[A J 


i = l 
i 2' 
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The probability of two heads given at least one head is 


PUi/1 2 Mi 


_ P[A l A 2 n (A\ u A 2 ) ] 
P[A i u 4 2 ] 


u 4 2 ] 


1 

4 

A 

4 


1 

3 


We obtained numerical answers to these two questions, but to do so we 
had to model the experiment; we assumed that the four sample points 
were equally likely. 


When speaking of conditional probabilities we are conditioning on some 
given event B; that is, we are assuming that the experiment has resulted in some 
outcome in B. B, in effect, then becomes our “ new ” sample space. One ques¬ 
tion that might be raised is: For given event B for which P[B] > 0, is P[‘\B] a 
probability function having as its domain? In other words, does P[- |B] 
satisfy the three axioms ? Note that: 


(i) P[A | B ] = P[AB]/P[B ] > 0 for every A e . 

(ii) P[S1 1 B] = P[QB]/P[B] = P[B]/P[B] = 1. 

(iii) If A,, A 2 , ... is a sequence of mutually exclusive events in s4 and 

OO 

(J At e sd, then 


IU«I B 


(O/i) 5 p p(^i z/u.b] 


P[B] 


P[B] 




Hence, /*[• |B] for given B satisfying P[B] > 0 is a probability function, which 
justifies our calling it a conditional probability. P[- |U] also enjoys the same 
properties as the unconditional probability. The theorems listed below are 
patterned after those in Subsec. 3.4. 


Properties of P[- |B] Assume that the probability space (£2. j/, P [-]) is given, 
and let Bess? satisfy P[B] >0. 

Theorem 22 P[<t> \ B] = 0. //// 

Theorem 23 If A ly A„ are mutually exclusive events in j/, then 

P[A l Kj -KjA n \B]=Y J P[A i \Bl //// 

i -1 

Theorem 24 If A is an event in s/, then 

P[Z\B] = 1 -P[A\B], mi 
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Theorem 25 If A y and A 2 e si, then 

P[A l \B\=P[A l A 2 \B\ + P[A l A 2 \B\ //// 

Theorem 26 For every two events Ai and A 2 e si, 

P[A t u A 2 \B\ = P[A l \B]+P[A 2 \B\-P[A l A 2 \B]. //// 

Theorem 27 If A x and A 2 e si and A t <= A 2 , then 

P[A,\B\<P[A 2 IB\. //// 

Theorem 28 If A t , A 2 ,..., A n e si, then 

P[A X uA 2 u---uA„\B]< £ P[A t \B\. HU 

1 =i 

Proofs of the above theorems follow from known properties of /*[•] and 
are left as exercises. 

There are a number of other useful formulas involving conditional prob¬ 
abilities that we will state as theorems. These will be followed by examples. 


Theorem 29 Theorem of total probabilities For a given probability 
space (Q, si, P[- ]), if B u B 2 , ..., B n is a collection of mutually disjoint 

n 

events in si satisfying fi = (J Bj and P[Bj\ > 0 for j = 1 , ..., n, then 

for every A e si, P[A] = £ P[A j Bj\P[Bj]. 

j- 1 


proof Note that A = Q ABj and the AB/s are mutually disjoint; 
j= i 


hence 


f\4i = ^[0 AB j\ = i p [ AB jl = i P[A\Bj-\P[Bj\. 
U=i J j= i j= i 


III / 


Corollary For a given probability space (£2, si, P[- ]) let Be si 
satisfy 0 < P[B] < 1; then for every A e si 

P[A\ = P[A\ B]P[B] + P[A\ E]P[B]. //// 


Remark Theorem 29 remains true if n = oo. 


1111 
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Theorem 29 (and its corollary) is particularly useful for those experiments 
that have stages; that is, the experiment consists of performing first one thing 
(first stage) and then another (second stage). Example 25 provides an example 
of such an experiment; there, one first selects an urn and then selects a ball 
from the selected urn. For such experiments, if Bj is an event defined only in 
terms of the first stage and A is an event defined in terms of the second stage, 
then it may be easy to find P[Bj]; also, it may be easy to find P[A \B } \, and then 
Theorem 29 evaluates P[A] in terms of P[Bj ] and P[A\Bj\ forj = 1, ..., n. In 
an experiment consisting of stages it is natural to condition on results of a first 
stage. 


Theorem 30 Bayes’ formula For a given probability space (Q, j/, /’[•]), 
if B 1 ,B 2 ,...,B„ is a collection of mutually disjoint events in sd satisfying 

n 

ft = y Bj and P[Bj] > 0 for j = 1, then for every A e s/ for which 

i=i 

P[A] > 0 


PROOF 


r^MI- ■ 

I PlAIBj]P[B,] 

J= 1 

P[B k A] _ P\A | BJPIBJ 

P[A] jtP[A\B j]P[Bj ] 


by using both the definition of conditional probability and the theorem of 
total probabilities. //// 


Corollary For a given probability space (Q, s/, P\- ]) let A and Be s/ 
satisfy P[A] > 0 and 0 < P[B] < 1; then 


P[B\A] = 


P[A\B]P[B] 

P[A\B]P[B]+P[A\B}P[B]' 


mi 


Remark Theorem 30 remains true if n = oo. 


//// 


As was the case with the theorem of total probabilities, Bayes’ formula is 
also particularly useful for those experiments consisting of stages. If Bj, 
j= 1,...,«, is an event defined in terms of a first stage and A is an event defined 
in terms of the whole experiment including a second stage, then asking for 
P[B k \A] is in a sense backward; one is asking for the probability of an event 
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defined in terms of a first stage of the experiment conditioned on what happens 
in a later stage of the experiment. The natural conditioning would be to con¬ 
dition on what happens in the first stage of the experiment, and this is precisely 
what Bayes’ formula does; it expresses P[B k \A] in terms of the natural con¬ 
ditioning given by P[A \ Bj] and P[Bj], j = 1, ..., n. 

Theorem 31 Multiplication rule For a given probability space 
(Q, s/, /’[•]), let A l ,...,A n be events belonging to for which 
P[A l • • • A n ~ t ] > 0; then 

P[A l A 2 ---A n \=P[A l \P[A 2 \A l }P[A 3 \A l A 2 ]---P[A n \Ai---A n - l l 

proof The proof can be attained by employing mathematical 
induction and is left as an exercise. //// 

As with the two previous theorems, the multiplication rule is primarily 
useful for experiments defined in terms of stages. Suppose the experiment has 
n stages and Aj is an event defined in terms of stage j of the experiment; then 
/[AjlAtAi-Aj-J is the conditional probability of an event described in 
terms of what happens on stage j conditioned on what happens on stages 
1,2, 1. The multiplication rule gives P[A l A 2 -A n \ in terms of 

the natural conditional probabilities P[A J -\A l A 2 Aj_ t ] fory = 2,..., n. 


EXAMPLE 25 There are five urns, and they are numbered 1 to 5. Each 
urn contains 10 balls. Urn i has i defective balls and 10 — i nondefective 
balls, i= 1, 2, ..., 5. For instance, urn 3 has three defective balls and 
seven nondefective balls. Consider the following random experiment: 
First an urn is selected at random, and then a ball is selected at random 
from the selected urn. (The experimenter does not know which urn was 
selected.) Let us ask two questions: (i) What is the probability that a 
defective ball will be selected? (ii) If we have already selected the ball 
and noted that it is defective, what is the probability that it came from 
urn 5? 


solution Let A denote the event that a defective ball is selected and 
B t the event that urn i is selected, i = 1, ..., 5. Note that />[£,] = j, 
i = 1,..., 5, and P[A | Bj] = = 1,..., 5. Question (i) asks, What is 

P[/4]? Using the theorem of total probabilities, we have 


P[A]= lP[A\B i ]P[B i ]= t ~ 
i= 1 i= 1 tU 


1 _ J_ y • _ 

5 ~ 50 — 50 2 


2 

10 ' 
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Note that there is a total of 50 balls of which 15 are defective! Question 
(ii) asks, What is P[B S \A]1 Since urn 5 has more defective balls than 
any of the other urns and we selected a defective ball, we suspect that 
P[B S \A] >P[B t \A] for i= 1, 2, 3, or 4. In fact, we suspect P[B S \A] 
>P[B^\A] > ■■■ >P[B l \A], Employing Bayes’ formula, we find 


Similarly, 


/m-w™ .t-i 

iPWBJftBd ■« 


1 

3' 


P[B k \A] = 


(fc/io)-j 

3 

1 0 


k 

15’ 


k= 1,..., 5, 


substantiating our suspicion. Note that unconditionally all the B,’s were 
equally likely whereas, conditionally (conditioned on occurrence of event 
A), they were not. Also, note that 


I l P[B k \A] = 

k= 1 


5 k 

y — 
kk is 


_ 1 V, _ 1 56 _ . 

15 £ 15 2 


llll 


EXAMPLE 26 Assume that a student is taking a multiple-choice test. On a 
given question, the student either knows the answer, in which case he 
answers it correctly, or he does not know the answer, in which case he 
guesses hoping to guess the right answer. Assume that there are five 
multiple-choice alternatives, as is often the case. The instructor is con¬ 
fronted with this problem: Having observed that the student got the 
correct answer, he wishes to know what is the probability that the student 
knew the answer. Let p be the probability that the student will know the 
answer and 1 — p the probability that the student guesses. Let us assume 
that the probability that the student gets the right answer given that he 
guesses is y. (This may not be a realistic assumption since even though the 
student does not know the right answer, he often would know that certain 
alternatives are wrong, in which case his probability of guessing correctly 
should be better than y.) Let A denote the event that the student got the 
right answer and B denote the event that the student knew the right answer. 
We are seeking P[B\A]. Using Bayes’ formula, we have 

_ m eim L£ 

1 1 J PIA\B]PIB] + P[AIB]P[B] 1-p + Ut-pY 
Note that 


P 

p + K 1 -p) 


^p • 


//// 
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EXAMPLE 27 An urn contains ten balls of which three are black and seven 
are white. The following game is played: At each trial a ball is selected 
at random, its color is noted, and it is replaced along with two additional 
balls of the same color. What is the probability that a black ball is 
selected in each of the first three trials? Let B t denote the event that a 
black ball is selected on the ith trial. We are seeking P[B l B 2 B 3 ]- By the 
multiplication rule, 

P[B l B 1 B 3 ] = J P[B 1 ]/’[B 2 ! | B,B 2 ] = fV ' tV TT = r« • HU 


EXAMPLE 28 Suppose an urn contains M balls of which K are black and 
M — K are white. A sample of size n is drawn. Find the probability 
that the y'th ball drawn is black given that the sample contains k black 
balls. (We intuitively expect the answer to be kjtt.) We have to con¬ 
sider sampling (i) with replacement and (ii) without replacement. 


solution Let A k denote the event that the sample contains exactly 
k black balls and Bj denote the event that the yth ball drawn is black. 
We seek P[Bj\A k ], Consider (i) first. 


P[A k ] = 


/«\ K\M- K) n ~ k 
\k) AT 


and P[A k \Bj] 


In - 1\ K k ~ l (M — K) n ~ k 
\k-l) M"-‘ 


by Eq. (3) of Subsec. 3.5. Since the balls are replaced, P[Bj] = KfM for 
any j. Hence, 


P[Bj\A k ] = 


P[A k \B_ 






P[A k ] 


For case (ii), 

P[A] = 


© 

/ M 

\ n - 

-t) 


a 

! 


and P[A k \Bj] = 


K\M — KY~ k /M" 

(K- \\(M— K\ 
\/c — l/\ n — k ) 


n 




by Eq. (5) of Subsec. 3.5. P[B } ] = £ P[B 3 \ C f ].P[C;], where C ; denotes 

■ = o 

the event of exactly i black balls in the first j — 1 draws. Note that 


PlC t ] = 


mm 
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and 


and so 


P[Bj\C t ] 


K-i 

M-j+V 


Finally, 


7-1 


TO = I 
1 = 0 


K-i 1 


/ M - 

\j~ 1 

-D 

M-j + l | 

/ M ' 

{)-V 

I 


K 

M 


P[Bj \ A k ] 


P[A k \Bj]P[Bj\ 


* Ji 

1 1 

7 M- 

\« - 

i 1 ) 

1 1 

P[A k ] | 

i K ) 

[kj 

* * 

1 1 

/(») 

1 


k 

n 


Thus we obtain the same answer under either method of sampling. "*7///, 


Independence of events If P[A\B] does not depend on evenr B, that is, 
P[A\B] = P[A], then it would seem natural to say that event A is independent 
of event B. This is given in the following definition. 

Definition 19 Independent events For a given probability space 
(fl, sd, /*[•]), let A and B be two events in s/. Events A and B are 
defined to be independent if and only if any one of the following conditions 
is satisfied: 

(i) P[AB] = P[A]P[B]. 

(ii) P[A\B] =P[A] ifP[B] >0. 

(iii) P[B\A]=P[B] if P[A] >0. //// 

Remark Some authors use “statistically independent,” or “stochasti¬ 
cally independent,” instead of “ independent.” //// 

To argue the equivalence of the above three conditions, it suffices to show 
that (i) implies (ii), (ii) implies (iii), and (iii) implies (i). If P[AB] = P[A]P[B], 
then P[A \ B] = P[AB]/P[B] = P[A]P[B]/P[B] = P[A] forF[B] > 0; so (i) implies 
(ii). If P[A\B] =P[A], then P[B\A]= P[A\B]P[B]!P[A]= P[A]P[B]IP[A] = 
P[B] for P[A] >0 and P[B] >0; so (ii) implies (iii). And if P[B\A] = P[B], 
then P[AB]=P[B\A]P[A]=P[B]P[A] for />[/!]> 0. Clearly P[AB] = 
P[A]P[B] if P[A\ = 0 or P[B] = 0. 
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EXAMPLE 29 Consider the experiment of tossing two dice. Let A denote the 
event of an odd total, B the event of an ace on the first die, and C the 
event of a total of seven. We pose three problems: 

(i) Are A and B independent? 

(ii) Are A and C independent? 

(iii) Are B and C independent? 

We obtain P[A\B] = \ = P[A\, P[A\C\ = \* P[A] = i, and P[C\B] = 
& = P[C] = so A and B are independent, A is not independent of C, 
and B and C are independent. //// 


The property of independence of two events A and B and the property that 
A and B are mutually exclusive are distinct, though related, properties. For 
example, two mutually exclusive events A and B are independent if and only if 
.P[/4].P[B] = 0, which is true if and only if either A or B has zero probability. 
Or if P[A ] # 0 and P[B] # 0, then A and B independent implies that they are 
not mutually exclusive, and A and B mutually exclusive implies that they are not 
independent. Independence of A and B implies independence of other events 
as well. 

Theorem 32 If A and B are two independent events defined on a given 
probability space (fi, sd, /•[•]), then A and B are independent, A and B 
are independent, and A and B are independent. 

PROOF 

P[AB] = P[A] — P[AB] = P[A] - P[A]P[B] = P[A}{\ - P[B]) = 

P[A]P[B\. 

Similarly for the others. //// 

The notion of independent events may be extended to more than two 
events. 


Definition 20 Independence of several events For a given probability 
space (fi, stf, F’l - ]), let A l , A 2 , ■■■, A„ be n events in j/. Events A u 
A 2 , ..., A„ are defined to be independent if and only if 

P[Ai Ajl = P[A{\P[Aj] for i ^ j 

P[AiAjA k ] = PIAJPIAjIPIA,] for i * j,j *k,i*k 



III / 
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One might inquire whether all the above conditions are required in the 
definition. For instance, does P[A l A 2 A 3 \ =/ , U 1 ]/ , U 2 ]/ , U 3 ] imply P[A l A 1 ] 
= P[A iM^]? Obviously not, since P[A 1 A 2 A 3 ] =/»[ J 4 1 ]P[ J 4 2 ]/ , U 3 ] if PU 3 ] 
= 0, but /*[/4 1 /4 2 ] # /*[>4 1 ]P[y4 2 ] if ^ and y4 2 are not independent. Or does 
pairwise independence imply independence? Again the answer is negative, 
as the following example shows. 


EXAMPLE 30 Pairwise independence does not imply independence. Let A t 
denote the event of an odd face on the first die, A 2 the event of an odd face 
on the second die, and A 3 the event of an odd total in the random experi¬ 
ment that consists of tossing two dice. P[A l ]P[A 2 ] = \ • \ = P\A l A 2 \, 
PiAiWA = i 4 = P[A 3 \A x ]P[AA = P[A t A 3 ], and P[A 2 A 3 ]=i = 
P[A 2 \P[A 3 ]‘, so A t , A 2 , and A 3 are pairwise independent. However 
P[A t A 2 A 3 ] = 0 # i = P[A l ]P[A 2 y > [A 3 '\\ so A u A 2 , and A 3 are not 
independent. //// 

In one sense, independence and conditional probability are each used to find 
the same thing, namely, P[AB], for P[AB] = P[A]P[B ] under independence and 
P[AB] = P[A | B]P[B ] under nonindependence. The nature of the events A and 
B may make calculations of P[A], P[B], and possibly P[A\B] easy, but direct 
calculation of P[AB] difficult, in which case our formulas for independence or 
conditional probability would allow us to avoid the difficult direct calculation 
of P[AB]. We might note that P[AB~\ = P\A \ B]P[B] is valid whether or not A 
is independent of B provided that P[A\B] is defined. 

The definition of independence is used not only to check if two given events 
are independent but also to model experiments. For instance, for a given 
experiment the nature of the events A and B might be such that we are willing 
to assume that A and B are independent; then the definition of independence gives 
the probability of the event A n B in terms of P[A] and P[B]. Similarly for 
more than two events. 


EXAMPLE 31 Consider the experiment of sampling with replacement from 
an urn containing M balls of which K are black and M — K white. Since 
balls are being replaced after each draw, it seems reasonable to assume that 
the outcome of the second draw is independent of the outcome of the 
first. Then P[two blacks in first two draws] = 

P[black on first draw]P[black on second draw] = ( K/M) 2 . //// 
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PROBLEMS 

To solve some of these problems it may be necessary to make certain assumptions, 
such as sample points are equally likely, or trials are independent, etc., when such 
assumptions are not explicitly stated. Some of the more difficult problems, or those 
that require special knowledge, are marked with an *. 

1 One urn contains one black ball and one gold ball. A second urn contains one 
white and one gold ball. One ball is selected at random from each urn. 

(a) Exhibit a sample space for this experiment. 

( b ) Exhibit the event space. 

(c) What is the probability that both balls will be of the same color? 

( d ) What is the probability that one ball will be green? 

2 One urn contains three red balls, two white balls, and one blue ball. A second 
urn contains one red ball, two white balls, and three blue balls. 

(a) One ball is selected at random from each urn. 

(i) Describe a sample space for this experiment. 

(ii) Find the probability that both balls will be of the same color. 

(iii) Is the probability that both balls will be red greater than the prob¬ 
ability that both will be white? 

( b ) The balls in the two urns are mixed together in a single urn, and then a sample 
of three is drawn. Find the probability that all three colors are represented, 
when (i) sampling with replacement and (ii) without replacement. 

3 If A and B are disjoint events, P[A] =.5, and P[A u if?] = .6, what is /’[/?]? 

4 An urn contains five balls numbered 1 to 5 of which the first three are black and 
the last two are gold. A sample of size 2 is drawn with replacement.’ Let B, 
denote the event that the first ball drawn is black and B 2 denote the event that the 
second ball drawn is black. 

(a) Describe a sample space for the experiment, and exhibit the events B u B 2 , 
and BxBi. 

0 b ) Find P[B,l P[B 2 ], and P[B l B 2 ]. 

(c) Repeat parts (a) and (b) for sampling without replacement. 

5 A car with six spark plugs is known to have two malfunctioning spark plugs. 
If two plugs are pulled at random, what is the probability of getting both of 
the malfunctioning plugs ? 

6 In an assembly-line operation, ( of the items being produced are defective. If 
three items are picked at random and tested, what is the probability: 

(a) That exactly one of them will be defective? 

(, b ) That at least one of them will be defective? 

7 In a certain game a participant is allowed three attempts at scoring a hit. In the 
three attempts he must alternate which hand is used; thus he has two possible 
strategies: right hand, left hand, right hand; or left hand, right hand, left hand. 
His chance of scoring a hit with his right hand is .8, while it is only .5 with his 
left hand. If he is successful at the game provided that he scores at least two hits 
in a row, what strategy gives the better chance of success? Answer the same 



44 


PROBABILITY 


I 


question if .8 is replaced by p x and .5 by p 2 . Does your answer depend on p, 
and p 2 ? 

8 (a) Suppose that A and B are two equally strong teams. Is it more probable 

that A will beat B in three games out of four or in five games out of seven ? 
(b) Suppose now that the probability that A beats B in an individual game is p. 
Answer part (a). Does your answer depend on pi 

9 If P[A] = J and P[B] = can A and B be disjoint ? Explain. 

10 Prove or disprove: If P[A] = P[B] =p, then P[AB] <p 2 . 

11 Prove or disprove: If P[A\ = P[B] then A — B. 

12 Prove or disprove: If P[A] — 0, then A = <f>. 

13 Prove or disprove: If P[A] = 0, then P[AE\ =0. 

14 Prove: If P[A] = a. and P[B] = yS, then P[AB] > 1 — a — yS. 

15 Prove properties (i) to (iv) of indicator functions. 

16 Prove the more general statement in Theorem 19. 

17 Exhibit (if such exists) a probability space, denoted by (D, j/, /*[•]), which satisfies *4 
the following. For A y and A 2 members ofs/, if P[A X ] = P[A 2 ], then A l = A 2 . 

18 Four drinkers (say I, II, III, and IV) are to rank three different brands of bder 
(say A, B, and C) in a blindfold test. Each drinker ranks the three beefs as 1 
(for the beer he likes best), 2, and 3, and then the assigned ranks of each brand 
of beer are summed. Assume that the drinkers really cannot discriminate between 
beers so that each is assigning his rankings at random. 

(a) What is the probability that beer A will receive a total score of 4? 

(b) What is the probability that some beer will receive a total score of 4? 

(c) What is the probability that some beer will receive a total score of 5 or less? 

19 The following are three of the classical problems in probability. 

( a ) Compare the probability of a total of 9 with a total of 10 when three fair . 
dice are tossed once (Galileo and Duke of Tuscany). 

(b) Compare the probability of at least one 6 in 4 tosses of a fair die with the. 
probability of at least one double-6 in 24 tosses of two fair dice (Chevalier 
de Mdr6). 

(c) Compare the probability of at least one 6 when six dice are rolled witjj, the 
probability of at least two 6s when twelve dice are rolled (Pepys to Newton). 

20 A seller has a dozen small electric motors, two of which are faulty. A customer is 
interested in the dozen motors. The seller can crate the motors with all twelve in 
one box or with six in each of two boxes; he knows that the customer will inspect 
two of the twelve motors if they are all crated in one box and one motor from each 
of the two smaller boxes if they are crated six each to two smaller boxes. He 
has three strategies in his attempt to sell the faulty motors: (i) crate all twelve 
in one box; (ii) put one faulty motor in each of the two smaller boxes; or (iii) put 
both of the faulty motors in one of the smaller boxes and no faulty motors in the 
other. What is the probability that the customer will not inspect a faulty motor 
under each of the three strategies? 
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21 A sample of five objects is drawn from a larger population of N objects ( N > 5). 
Let /V w or A' w0 denote the number of different samples that could be drawn 
depending, respectively, on whether sampling is done with or without replacement. 
Give the values for 7V W and /V w0 . Show that when N is very large, these two values 
are approximately equal in the sense that their ratio is close to 1 but not in the 
sense that their difference is close to 0. 

22 Out of a group of 25 persons, what is the probability that all 25 will have different 
birthdays? (Assume a 365-day year and that all days are equally likely.) 

23 A bridge player knows that his two opponents have exactly five hearts between 
the two of them. Each opponent has thirteen cards. What is the probability 
that there is a three-two split on the hearts (that is, one player has three hearts 
and the other two)? 

24 ( a ) If r balls are randomly placed into n urns (each ball having probability 1/n 

of going into the first urn), what is the probability that the first urn will 
is contain exactly k balls? 

* t ( b ) Let «->oo and r-> oo while r/n = m remains constant. Show that the 
• probability you calculated approaches e- "'m k lk\. 

£5 A biased coin has probability p of landing heads. Ace, Bones, and Clod toss the 
coin successively, Ace tossing first, until a head occurs. The person who tosses 
the first head wins. Find the probability of winning for each. 

*26 It is told that in certain rural areas of Russia marital fortunes were once told in the 
following way: A girl would hold six strings in her hand with the ends protruding 
above and below; a friend would tie together the six upper ends in pairs and then 
tie together the six lower ends in pairs. If it turned out that the friend had tied 

-*-the six strings into at least one ring, this was supposed to indicate that the girl 

would get married within a year. What is the probability that a single ring will 
be formed when the strings are tied at random? What is the probability that at 
'least one ring will be formed? Generalize the problem to 2 n strings. 

27 Mr. Bandit, a well-known rancher and not so well-known part-time cattle rustler, 
has twenty head of cattle ready for market. Sixteen of these cattle are his own 

^nd consequently bear his own brand. The other four bear foreign brands. Mr. 
Bandit knows that the brand inspector at the market place checks the brands of 
^20 percent of the cattle in any shipment. He has two trucks, one which will haul 
all twenty cattle at once and the other that will haul ten at a time. Mr. Bandit 
• feels that he has four different strategies to follow in his attempt to market the 

cattle without getting caught. The first is to sell all twenty head at once; the 

others are to sell ten head on two different occasions, putting all four stolen cattle 
in one set of ten, or three head in one shipment and one in the other, or two head in 
each of the shipments of ten. Which strategy will minimize Mr. Bandit's prob¬ 
ability (Sf getting caught, and what is his probability of getting caught under each 
strategy? 

28 Show that the formula of Eq. (4) is the same as the formula of Eq. (5). 
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29 Prove Theorem 31. 

30 Either prove or disprove each of the following (you may assume that none of the 
events has zero probability): 

(a) If P[A | B] > P[A], then P[B\A]> P[B ]. 

C b ) If P[A] > P[B], then P[A \ C ] > P[B\ Cl 

31 A certain computer program will operate using either of two subroutines, say A 
and B, depending on the problem; experience has shown that subroutine A will be 
used 40 percent of the time and B will be used 60 percent of the time. If A is 
used, then there is a 75 percent probability that the program will run before its 
time limit is exceeded; and if B is used, there is a 50 percent chance that it will do 
so. What is the probability that the program will run without exceeding the time 
limit ? 

32 Suppose that it is known that a fraction .001 of the people in a town have tuber¬ 
culosis (TB). A tuberculosis test is given with the following properties: If the 
person does have TB, the test will indicate it with a probability .999. If he does 
not have TB, then there is a probability .002 that the test will erroneously indicate 
that he does. For one randomly selected person, the test shows that he has 
TB. What is the probability that he really does? 

*33 Consider the experiment of tossing two fair regular tetrahedra (a polyhedron with 
four faces numbered 1 to 4) and noting the numbers on the downturned faces. 

(a) Give three proper events (an event A is proper if 0 <P[A] < 1) which are 
independent (if such exist). 

(b) Give three proper events which are pairwise independent but not independent 
(if such exist). 

(c) Give four proper events which are independent (if such exist). 

34 Prove or disprove: 

( a ) If A and B are independent events, then P[AB \ C ] = P[A \ C\P[B \ C]. 

( b ) If P[A | fi] = P[B], then A and B are independent. 

35 Prove or disprove: 

{a) If P[A | B ] > P[A], then P[B\A] > P[B]. 

(b) If P[B | A] =P[B\A], then A and B are independent. 

(c) If a = P[A] and b = P[E], then P\A \B]>(a + b — 1)1 b. 

36 Consider an urn containing 10 balls of which 5 are black. Choose an integer n 
at random from the set 1, 2, 3,4, 5, 6, and then choose a sample of size n without 
replacement from the urn. Find the probability that all the balls in the sample 
will be black. 

37 A die is thrown as long as necessary for a 6 to turn up. Given that the 6 does not 
turn up at the first throw, what is the probability that more than four throws will 
be necessary? 

38 Die A has four red and two blue faces, and die B has two red and four blue faces. 
The following game is played: First a coin is tossed once. If it falls heads, the 
game continues by repeatedly throwing die A ; if it falls tails, die B is repeatedly 
tossed. 
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(a) Show that the probability of red at any throw is £. 

(b) If the first two throws of the die resulted in red, what is the probability of red 
at the third throw ? 

(c) If red turns up at the first n throws, what is the probability that die A is 
being used ? 

39 Urn A contains two white and two black balls; urn B contains three white and 
two black balls. One ball is transferred from A to 5; one ball is then drawn 
from B and turns out to be white. What is the probability that the transferred 
ball was white? 

*40 It is known that each of four people A, B, C, and D tells the truth in a given 
instance with probability £. Suppose that A makes a statement, and then D says 
that C says that B says that A was telling the truth. What is the probability 
that A was actually telling the truth? 

41 In a T maze, a laboratory animal is given a choice of going to the left and getting 
food or going to the right and receiving a mild electric shock. Assume that 
before any conditioning (in trial number 1) animals are equally likely to go to the 
left or to the right. After having received food on a particular trial, the prob¬ 
abilities of going to the left and right become .6 and .4, respectively, on the follow¬ 
ing trial. However, after receiving a shock on a particular trial, the probabilities 
of going to the left and right on the next trial are .8 and .2, respectively. What 
is the probability that the animal will turn left on trial number 2? On trial 
number 3 ? 

*42 In a breeding experiment, the male parent is known to have either two dominant 
genes (symbolized by AA) or one dominant and one recessive (Aa). These two 
cases are equally likely. The female parent is known to have two recessive genes 
(aa). Since the offspring gets one gene from each parent, it will be either Aa or 
aa, and it will be possible to say with certainty which one. 

(a) If we suppose one offspring is Aa, what is the probability that the male 
parent is A A ? 

(b) If we suppose two offspring are both Aa, what is the probability that the 

» male parent is A A ? 

(c) If one offspring is aa, what is the probability that the male parent is Aa! 
4f The constitution of two urns is 


three black 
two white 


four black 
six white 


A draw is made by selecting an urn by a process which assigns probability p to the 
selection of urn I and probability 1 - p to the selection of urn II. The selection 
of a ball from either urn is by a process which assigns equal probability to all 
balls in the urn. What value of p makes the probability of obtaining a black 
ball the same as if a single draw were made from an urn with seven black and 
eight white balls (all balls equally probable of being drawn)? 
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44 Given P[A] = .5 and P[A kj B] = .6, find P[B] if: 

(a) A and B are mutually exclusive. 

( b ) A and B are independent. 

(c) P[A | B] = .4. 

45 Three fair dice are thrown once. Given that no two show the same face: 

{a) What is the probability that the sum of the faces is 7? 

( b ) What is the probability that one is an ace? 

46 Given that P[A ] > 0 and P[B] >0, prove or disprove: 

(a) If P[A\ = P[B\ then P[A \ B] = P[B\A}. 

(b) If P[A | B] = P[B\ A], then P[A] = P[B ]. 

47 Five percent of the people have high blood pressure. Of the people with high 
blood pressure, 75 percent drink alcohol; whereas, only 50 percent of the people 
without high blood pressure drink alcohol. What percent of the drinkers have 
high blood pressure? 

48 A distributor of watermelon seeds determined from extensive tests that 4 percent 
of a large batch of seeds will not germinate. He sells the seeds in packages of 50 
seeds and guarantees at least 90 percent germination. What is the probability 
that a given package will violate the guarantee? 

49 If A and B are independent, P[A] = I, and P[B] = J, find P[A u B], 

50 Mr. Stoneguy, a wealthy diamond dealer, decides to reward his son by allowing 
him to select one of two boxes. Each box contains three stones. In one box two 
of the stones are real diamonds, and the other is a worthless imitation; and in the 
other box one is a real diamond, and the other two worthless imitations. If the 
son were to choose randomly between the two boxes, his chance of getting two 
real diamonds would be £. Mr. Stoneguy, being a sporting type, allows his 
son to draw one stone from one of the boxes and to examine it to see if it is a real 
diamond. The son decides to take the box that the stone he tested came from if 
the tested stone is real and to take the other box otherwise. Now what is the 
probability that the son will get two real diamonds? 

51 If P[A] =P[B\=P[B\ A] = i, are A and B independent ? 

52 If A and B are independent and P[A] = P[B ] = i, what is P[AB u AB]? 

53 If P[B] = P[A | B] = P[C\ AB] = J, what is P[ABC]? 

54 If A and B are independent and P[A] = P[B\A] = £, what is P[A u B]? 

55 Suppose B u B 2 , and B 2 are mutually exclusive. If P[Bj] = ' and P[A\B } ] =jj 6 
for j = 1, 2, 3, what is P[A]1 

*56 The game of craps is played by letting the thrower toss two dice until he either wins 
or loses. The thrower wins on the first toss if he gets a total of 7 or 11; he loses 
on the first toss if he gets a total of 2, 3, or 12. If he gets any other total on his 
first toss, that total is called his point. He then tosses the dice repeatedly until he 
obtains a total of 7 or his point. He wins if he gets his point and loses if he gets a 
total of 7. What is the thrower’s probability of winning? 

57 In a dice game a player casts a pair of dice twice. He wins if the two totals 
thrown do not differ by more than 2 with the following exceptions: If he gets a 
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3 on the first throw, he must produce a 4 on the second throw; if he gets an 11 on 
the first throw, he must produce a 10 on the second throw. What is his prob¬ 
ability of winning? 

58 Assume that the conditional probability that a child born to a couple will be 
m^Je is i + Wei — fe 2 , where e l and £2 are certain small constants, m is the 
dumber of male children already born to the couple, and/is the number of female 
children already born to the couple. 

(a) What is the probability that the third child will be a boy given that the first 
two are girls ? 

( b ) Find the probability that the first three children will be all boys. 

(c) Find the probability of at least one boy in the first three children. 

(Your answers will be expressed in terms of £2 and e 2 .) 

*59 A network of switches a , b, c, and d is connected across the power lines A and B 
as shown in the sketch. Assume that the switches operate electrically and have 
independent operating mechanisms. All are controlled simultaneously by the 
same impulses; that is, it is intended that on an impulse all switches shall close 
Simultaneously. But each switch has a probability p of failure (it will not close 
when it should). 



(a) What is the probability that the circuit from A to B will fail to close? 

(b) If a line is added on at e, as indicated in the sketch, what is the probability 
that the circuit from A to B will fail to close? 

(c) If a line and switch are added at e , what is the probability that the circuit from 
' , A to B will fail to close? 

n 

60 Let B u B 2 ,... , B„ be mutually disjoint, and let B = \j Bj. Suppose P[Bj] > 0 

1 = 1 

and P[A \ Bj] = p for Show that P[A | B]=p. 

61 In a laboratory experiment, an attempt is made to teach an animal to turn right 
in a maze. To aid in the teaching, the animal is rewarded if it turns right on a 
given trial and punished if it turns left. On the first trial the animal is just as 
likely to turn right as left. If on a particular trial the animal was rewarded, his 
probability of turning right on the next trial is p , > i, and if on a given trial the 
animal was punished, his probability of turning right on the next trial is p 2 > p 1 . 

(a) What is the probability that the animal will turn right on the third trial ? 

(b) What is the probability that the animal will turn right on the third trial, 
given that he turned right on the first trial? 
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*62 You are to play ticktacktoe with an opponent who on his turn makes his mark by 
selecting a space at random from the unfilled spaces. You get to mark first. 
Where should you mark to maximize your chance of winning, and what is your 
probability of winning? (Note that your opponent cannot win, he can only 
tie.) 

63 Urns I and II each contain two white and two black balls. One ball is selected 
from urn I and transferred to urn II; then one ball is drawn from urn II and turns 
out to be white. What is the probability that the transferred ball was white? 

64 Two regular tetrahedra with faces numbered 1 to 4 are tossed repeatedly until a 
total of 5 appears on the down faces. What is the probability that more than two 
tosses are required ? 

65 Given 1°[A] = .5 and P[A u B] = .7: 

(a) Find P[5] if A and B are independent. 

( b ) Find P[B] if A and B are mutually exclusive. 

(c) Find P[E\ if P[A | B] =.5. 

66 A single die is tossed; then n coins are tossed, where n is the number shown on the 
die. What is the probability of exactly two heads? 

*67 In simple Mendelian inheritance, a physical characteristic of a plant or animal is 
determined by a single pair of genes. The color of peas is an example. Let y and 
g represent yellow and green; peas will be green if the plant has the color-gene 
pair(< 7 , g); they will be yellow if the color-gene pair is O', y) or (y, g). In view of 
this last combination, yellow is said to be dominant to green. Progeny get one 
gene from each parent and are equally likely to get either gene from each parent’s 
pair. If (y, y) peas are crossed with (g, g) peas, all the resulting peas will be (y, g) 
and yellow because of dominance. If ( y, g) peas are crossed with (g, g) peas, the 
probability is .5 that the resulting peas will be yellow and is .5 that they will be 
green. In a large number of such crosses one would expect about half the result¬ 
ing peas to be yellow, the remainder to be green. In crosses between (y, g) and 
O', 9) peas, what proportion would be expected to be yellow? What proportion 
of the yellow peas would be expected to be (y, y)7 
*68 Peas may be smooth or wrinkled, and this is a simple Mendelian character. 
Smooth is dominant to wrinkled so that (s, s) and (s, w) peas are smooth while 
(b’, w) peas are wrinkled. If O', 9) O', w ) peas are crossed with (g, g) (w, w) peas, 
what are the possible outcomes, and what are their associated probabilities? For 
the O’, 9 ) 0, w ) by (9, 9) 0, w) cross? For the O’, 9) 0, w) by O', 9) 0, w) cross? 

69 Prove the two unproven parts of Theorem 32. 

70 A supplier of a certain testing device claims that his device has high reliability 
inasmuch as P[A\B ] =P[A\B] = .95, where A = {device indicates component is 
faulty} and B = (component is faulty}. You hope to use the device to locate the 
faulty components in a large batch of components of which 5 percent are faulty. 

(a) What isP[B|zl]? 

(b) Suppose you want P[B\A] -.9. Let p = P[A \B] = P[A \ B], How large 
does p have to be ? 
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RANDOM VARIABLES, DISTRIBUTION 
FUNCTIONS, AND EXPECTATION 


1 INTRODUCTION AND SUMMARY 

The purpose of this chapter is to introduce the concepts of random variable, 
distribution and density functions, and expectation. It is primarily a “ definitions- 
and-their-understanding” chapter; although some other results are given as 
well. The definitions of random variable and cumulative distribution function 
are given in Sec. 2, and the definitions of density functions are given in Sec. 3. 
These definitions are easily stated since each is just a particular function. The 
cumulative distribution function exists and is defined for each random variable; 
whereas, a density function is defined only for particular random variables. 
Expectations of functions of random \-ariables are the underlying concept of all 
of Sec. 4. This concept is introduced by considering two particular, yet 
extremely important, expectations. These two are the mean and variance, 
defined in Subsecs. 4.1 and 4.2, respectively. Subsection 4.3 is devoted to the 
definition and properties of expectation of a function of a random variable. 
A very important result in the chapter appears in Subsec. 4.4 as the Chebyshev 
inequality and a generalisation thereof. It is nice to be able to attain so famous 
a result so soon and with so little weaponry. The Jensen inequality is given in 
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Subsec. 4.5. Moments and moment generating functions, which are expecta¬ 
tions of particular functions, are considered in the final subsection. One major 
unproven result, that of the uniqueness of the moment generating function, is 
given there. Also included is a brief discussion of some measures of some 
characteristics, such as location and dispersion, of distribution or density 
functions. 

This chapter provides an introduction to the language of distribution 
theory. Only the univariate case is considered; the bivariate and multivariate 
cases will be considered in Chap. IV. It serves as a preface to, or even as a 
companion to, Chap. Ill, where a number of parametric families of distribution 
functions is presented. Chapter III gives many examples of the concepts 
defined in Chap. II. 


2 RANDOM VARIABLE AND CUMULATIVE 
DISTRIBUTION FUNCTION 


2.1 Introduction 

In Chap. I we defined what we meant by a probability space, which we denoted 
by the triplet (O, si , F[ • ]). We started with a conceptual random experiment; 
we called the totality of possible outcomes of this experiment the sample space 
and denoted it by Q. si was used to denote a collection of subsets, called 
events, of the sample space. Finally our probability function />[ • ] was a set 
function having domain si and counterdomain the interval [0, 1], Our object 
was, and still is, to assess probabilities of events. In other words, we want to 
model our random experiment so as to be able to give values to the probabilities 
of events. The notion of random variable, to be defined presently, will be used 
to describe events, and a cumulative distribution function will be used to give the 
probabilities of certain events defined in terms of random variables; so both 
concepts will assist us in defining probabilities of events, our goal. One advan¬ 
tage that a cumulative distribution function will have over its counterpart, the 
probability function (they both give probabilities of events), is that it is a 
function with domain the real line and counterdomain the interval [0, 1], Thus 
we will be able to graph it. It will become a convenient tool in modeling 
random experiments. In fact, we will often model a random experiment by 
assuming certain things about a random variable and its distribution function 
and in so doing completely bypass describing the probability space. 
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2.2 Definitions 

We commence by defining a random variable. 

Definition 1 Random Variable For a given probability space (£2, sJ , 
P[ ■ ]), a random variable, denoted by I or X( ■ ), is a function with 
domain £2 and counterdomain the real line. The function X( ■ ) must be 
such that the set A r , defined by A r = {a>: X(oj) < r}, belongs to si for every 
real number r. I III 

If one thinks in terms of a random experiment, £2 is the totality of out¬ 
comes of that random experiment, and the function, or random variable, X( ■ ) 
with domain £2 makes some real number correspond to each outcome of the 
experiment. That is the important part of our definition. The fact that we 
also require the collection of cu’s for which X((o) < /-tobeanevent(i.e.,anelement 
of si) for each real number r is not much of a restriction for our purposes 
since our intention is to use the notion of random variable only in describing 
events. We will seldom be interested in a random variable per se; rather we 
will be interested in events defined in terms of random variables. One might 
note that the P[ ■ ] of our probability space (£2, si, P[ ■ ]) is not used in our 
definition. 

The use of words “random” and “variable” in the above definition is 
unfortunate since their use cannot be convincingly justified. The expression 
“random variable” is a misnomer that has gained such widespread use that it 
would be foolish for us to try to rename it. 

In our definition we denoted a random variable by either X( ■) or X. 
Although X( ■) is a more complete notation, one that emphasizes that a random 
variable is a function, we will usually use the shorter notation of X. For many 
experiments, there is a need to define more than one random variable; hence 
further notations are necessary. We will try to use capital Latin letters with or 
without affixes from near the end of the alphabet to denote random variables. 
Also, we use the corresponding small letter to denote a value of the random 
variable. 


EXAMPLE 1 Consider the experiment of tossing a single coin. Let the 
random variable X denote the number of heads. £2 = {head, tail}, and 
X{oS) = 1 if &> = head, and X(a>) = 0 if a> = tail; so, the random variable X 
associates a real number with each outcome of the experiment. We 
called X a random variable so mathematically speaking we should show 
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FIGURE 1 1 


• 1st dice 


that it satisfies the definition; that is, we should show that {to: X(a>) < r } 
belongs to si for every real number r. si consists of the four subsets: 
(j), {head}, {tail}, and £2. Now, if r < 0, {co: X(ai) < r) = (j >; and if 
0 < r < 1, {co: X{oi) < r} = {tail}; and if r > 1, {to: X(a>) < r) = £2 = {head, 
tail}. Hence, for each r the set {io: X(co) < r} belongs to si: so X( •) is a 
random variable. //// 

EXAMPLE 2 Consider the experiment of tossing two dice. £2 can be de¬ 
scribed by the 36 points displayed in Fig. 1. £2 = {(/, j): i = 1 ,..., 6 and 
j = 1,..., 6}. Several random variables can be defined; for instance, let 
X denote the sum of the upturned faces; so X(io) = i + j if to = (/', j). Also, 
let Y denote the absolute difference between the upturned faces; then 
Y(a>) = | i — y'| if a> = (/, j). It can be shown that both X and Y are ran¬ 
dom variables. We see that X can take on the values 2, 3, ..., 12 and Y 
can take on the values 0, 1, ..., 5. //// 

In both of the above examples we described the random variables in terms 
of the random experiment rather than in specifying their functional form; such 
will usually be the case. 

« Definition 2 Cumulative distribution function The cumulative distribution 
function of a random variable X, denoted by F x { ■), is defined to be 
that function with domain the real line and counterdomain the interval 
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[0, 1 ] which satisfies F x (x) = P[X < x] = P[{<x >: X(co) < a:}] for every real 
number *. Ill I 

A cumulative distribution function is uniquely defined for each random 
variable. If it is known, it can be used to find probabilities of events defined 
in terms of its corresponding random variable. (One might note that it is in 
this definition that we use the requirement that {co: X(w) < r } belong to si for 
every real r which appears in our definition of random variable X.) Note that 
different random variables can have the same cumulative distribution function. 
See Example 4 below. 

The use of each of the three words in the expression “cumulative distri¬ 
bution function” is justifiable. A cumulative distribution function is first of 
all a function; it is a distribution function inasmuch as it tells us how the values 
of the random variable are distributed, and it is a cumulative distribution func¬ 
tion since it gives the distribution of values in cumulative form. Many writers 
omit the word “cumulative” in this definition. Examples and properties of 
cumulative distribution functions follow. 


EXAMPLE 3 Consider again the experiment of tossing a single coin. Assume 
that the coin is fair. Let X denote the number of heads. Then, 

(0 if* <0 

F x {x) = | i if 0 < x < 1 
[l if 1 < x. 

Or F x (x ) = i/ t o, 0 (x) + / t i, oo)(*) < n our indicator function notation. 1111 


EXAMPLE 4 In the experiment of tossing two fair dice, let Y denote the 
absolute difference. The cumulative distribution of Y, F r ( ■), is sketched 
in Fig. 2. Also, let X k denote the value on the upturned face of the kth 
die for k = 1,2. X l and X 2 are different random variables, yet both 
have the same cumulative distribution function, which is F x (x) = 

5 j 

E 7 hi, f+i)M + he, oo>W and is sketched in Fig. 3. //// 

i= l o 

Careful scrutiny of the definition and above examples might indicate the 
following properties of any cumulative distribution function F x { ■). 
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Properties of a Cumulative Distribution Function F x ( ■) 

(0 F x (— °o)= lim F x (x) = 0, and F x ( + oo) = lim F x (x) = 1. 

x~* — oo x-* + co 

(ii) F x ( ■ ) is a monotone, nondecreasing function; that is, F x (a) < F x (b) 
for a <b. 

(iii) F x ( ■) is continuous from the right; that is, 

lim F x (x + //) = F x (x). 
o<*~o 

Except for (ii), we will not prove these properties. Note that the event 
(co: X(co) < b} = {X < b) = {X < a} u {a < X < b) and {X < a) r\ {a < X < b) 
= (j>; hence, F x (b) = P[X <b} = P[X <a] + P[a < X < b] > P[X <a] = F x (a) 
which proves (ii). Property (iii), the continuity of F x ( ■ ) from the right, results 
from our defining F x (x) to be P[X < x]. If we had defined, as some authors do, 
F x {x) to be P[X < x], then F x ( ■ ) would have been continuous from the left. 

Definition 3 Cumulative distribution function Any function F( ■) with 
domain the real line and counterdomain the interval [0, I] satisfying the 
above three properties is defined to be a cumulative distribution function. 

an 

This definition allows us to use the term “cumulative distribution func¬ 
tion” without mentioning random variable. 

After defining what is meant by continuous and discrete random variables 
in the first two subsections of the next section, we will give another property 
that cumulative distribution functions possess, the property of decomposition 
into three parts. 
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The cumulative distribution functions defined here are univariate; the 
introduction of bivariate and multivariate cumulative distribution functions 
will be deferred until Chap. IV. 


3 DENSITY FUNCTIONS 

Random variable and the cumulative distribution function of a random variable 
have been defined. The cumulative distribution function described the distri¬ 
bution of values of the random variable. For two distinct classes of random 
variables, the distribution of values can be described more simply by using 
density functions. These two classes, distinguished by the words “discrete” 
and “continuous,” are considered in the next two subsections. 

3.1 Discrete Random Variables 

Definition 4 Discrete random variable A random variable X will be 
defined to be discrete if the range of X is countable. If a random variable 
X is discrete, then its corresponding cumulative distribution function 
F x ( ■ ) will be defined to be discrete. mj 

By the range of X being countable we mean that there exists a finite or 
denumerable set of real numbers, say x t , x 2 , x 3 ,..., such that X takes on values 
only in that set. If X is discrete with distinct values x„ x 2 , ..., x n , ..., then 
D = (J {«: X(co) = x n }=\J{X = *„}, and {X = x,} n {X = xf = (f> for i * j; 

n n 

hence 1 = F[Q] = £ P[X = x„] by the third axiom of probability. 
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Definition 5 Discrete density function of a discrete random variable If 

A'is a discrete random variable with distinct values x u x 2 , ■ ■ ■, x n , ..., 
then the function, denoted by f x ( ■) and defined by 

r (__ — ■*/] if x — Xj, j = \, 2,n,... , . 

JA) (o i fx^Xj 

is defined to be the discrete density function of AT. //// 

The values of a discrete random variable are often called mass points-, and, 
fx(Xj) denotes the mass associated with the mass point x } . Probability mass 
function, discrete frequency function, and probability function are other terms 
used in place of discrete density function. Also, the notation p x ( ■) is some¬ 
times used intead of f x ( ■) for discrete density functions. f x ( ■) is a function 
with domain the real line and counterdomain the interval [0, 1], If we use the 
indicator function, 

fx(x) = tnX=x n ]I {x Jx), (2) 

n = 1 

where I {Xn] (x) = l if x = x„ and 7 { * n) (.x) = 0 if x # x„. 

Theorem 1 Let AT be a discrete random variable. F x ( ) can be obtained 
from/*(•), and vice versa. 

proof Denote the mass points of AT by x y , x 2 , _ Suppose /*(■) 

is given; then F x (x) = £ fx( x j)- Conversely, suppose F x (■) is given; 

U:xj<x) 

then f x (Xj ) = F x ( x j) ~ l |m F x (Xj — h); hence f x {x) can be found for 

o < (i-O 

each mass point x s \ however, f x (x) = 0 for x # x s , j = 1, 2,..., so f x (x) is 
determined for all real numbers. //// 

EXAMPLE 5 To illustrate what is meant in Theorem 1, consider the experi¬ 
ment of tossing a single die. Let X denote the number of spots on the 
upper face: 

fx(x) = 7(1,2.6)Wi 


Fx( X ) — X 0/6)4 i, 1+1)0) + 4<5, <*>)(•*)■ 

<= 1 


and 
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According to Theorem 1, for given f x (-), F x (x) can be found for any x\ 
for instance, if x = 2.5, 

F x ( 2.5) = I f x ( Xj ) =f x {\) +f x ( 2) = |. 

And, if F x ( ■) is given, f x (x) can be found for any x. For example, for 
* = 3, 

fxO) = F x O) - o Hm o ^(3 h) = (?) - (^) = ? I HI 


The cumulative distribution function of a discrete random variable has 
steps at the mass points; that is, at the mass point x s , F x ( ) has a step of size 
f x (Xj), and F x ( ) is flat between mass points. 


EXAMPLE 6 Consider the experiment of tossing two dice. Let X denote 
the total of the upturned faces. The mass points of X are 2, 3, ..., 12. 
f x (-) is sketched in Fig. 4. Let Y denote the absolute difference of 
the upturned faces; then/y(-) is given in tabular form by 


y 

0 

l 

2 

1 3 

4 

5 

fr ( y) 

6 

3 6 

1 0 

3 6 

8 

3 6 

6 

3 6 

4 

3 6 

2 

3 6 


The discrete density function tells us how likely or probable each of the 
values of a discrete random variable is. It also enables one to calculate the 
probability of events described in terms of the discrete random variable X. 
For example, let X have mass points x„ x 2 , ..., x „,...; then P[a < X < b] = 
I fx(Xj) for a < b. 

j :{a< xj<b} 
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Definition 6 Discrete density function Any function/( •) with domain 
the real line and counterdomain [0, 1] is defined to be a discrete density 
function if for some countable set x u x 2 , ..., x n , ..., 

(i) f(Xj) > 0 for j — 1,2,- 

(ii) f(x) = 0 for x / Xj; j = 1 , 2 ,.... 

(iii) £ /(*,) = 1, where the summation is over the points x„ x 2 , ..., 

x n ,.... HU 

This definition allows us to speak of discrete density functions without 
reference to some random variable. Hence we can talk about properties that 
a density function might have without referring to a random variable. 

3.2 Continuous Random Variables 

Definition 7 Continuous random variable A random variable X is 

called continuous if there exists a function f x ( ■) such that F x (x) = J f x (u)du 

— 00 

for every real number x. The cumulative distribution function F x ( •) of a 
continuous random variable X is called absolutely continuous. //// 

Definition 8 Probability density function of a continuous random variable 

If A is a continuous random variable, the function f x (-) in F x (x) = 

X 

J f x (u) du is called the probability density function of hh 

— 00 

Other names that are used instead of probability density function include 
density function, continuous density function, and integrating density function. 

Note that strictly speaking the probability density function f x (-) of a 
random variable X is not uniquely defined. All that the definition requires is 
that the integral of/*(■) gives F x {x) for every x, and more than one function 
f x (■) may satisfy such requirement. For example, suppose F x (x) = xl [0 x) {x) + 

X 

A,,»)(*); then /*(“) = / <o,i)( w ) satisfies F x (x) = { f x (u) du for every x, and 
so f x ( ) is a probability density function of X. However f x (u) - ho,i)( u ) + 

X 

69 I w (u) + I (i , 0 («) also satisfies F x (x) = J f x (u) du. (The idea is that if the 

— 00 

value of a function is changed at only a “few” points, then its integral is 
unchanged.) In practice a unique choice of f x {f is often dictated by continuity 
considerations and for this reason we will usually allow ourselves the liberty of 
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speaking of the probability density when in fact a probability density is more 
correct. 

One should point out that the word “continuous” in “continuous 
random variable ” is not used in its usual sense. Although a random variable 
is a function and the notion of a continuous function is fairly well established 
in mathematics, “continuous” here is not used in that usual mathematical 
sense. In fact it is not clear in what sense it is used. Two possible justifica¬ 
tions do come to mind. In contrasting discrete random variables with contin¬ 
uous random variables, one notes that a discrete random variable takes on a 
finite or denumerable set of values whereas a continuous random variable takes, 
on a nondenumerable set of values. Possibly it is the connection between 
“nondenumerable” and “continuum” that justifies use of the word “contin¬ 
uous.” All the continuous random variables that we shall encounter will take 
on a continuum of values. ■ The second justification arises when one notes that 
the absolute continuity of the cumulative distribution function is the regular 
mathematical definition of an absolutely continuous function (in words, a 
function is called absolutely continuous if it can be written as the integral of its 
derivative); the “continuous,” then, in a corresponding continuous random 
variable could be considered just an abbreviation of “absolutely continuous.” 

Theorem 2 Let A be a continuous random variable. Then F x ( ) can 
be obtained from an f x ( ■), and vice versa. 

proof If A is a continuous random variable and an /*(■) is given, 

X 

then F x (x) is obtained by integrating/^-); that is, F x (x) = J f x {u)du. On 

— 00 

the other hand, if F x ( ) is given, then an f x (x) can be obtained by differ¬ 
entiation; that is, f x (x) = dF x (x)jdx for those points x for which F x (x) is 
differentiable. jfjf 

The notations for discrete density function and probability density func¬ 
tion are the same, yet they have quite different interpretations. For discrete 
random variables f x (x ) = P\X = x], which is not true for continuous random 
variables. For continuous random variables. 

, , ira ai +A^-F^-A,) . 

dx ax-*o 2Ax 

hence f x (x) 2Ax » F x (x + Ax) - F x {x - Ax) = P[x - Ax < X < x + Ax]; that 
is, the probability that A is in a small interval containing the value x is approxi¬ 
mately equal to f x (x) times the width of the interval. For discrete random 
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variables f x (-) is a function with domain the real line and counterdomain the 
interval [0,1]; whereas, for continuous random variables f x {~) is a function with 
domain the real line and counterdomain the infinite interval [0, oo). 

Remark We will use the term “ density function ” without the modifier 
of “ discrete ” or “ probability ” to represent either kind of density. //// 


EXAMPLE 7 Let X be the random variable representing the length of a 
telephone conversation. One could model this experiment by assuming 
that the distribution of X is given by F x (x) = (1 — e~ Xx )f 0 ^(x), where 
X is some positive number. The corresponding probability density func¬ 
tion would be given by f x (x ) = Xe X *ho. x) (x). If we assume that 
telephone conversations are measured in minutes, T[5 < X < 10] = 
l\°Xe~ xX dx = e~ 5X - e~ 10X = e~ l - e~ 2 «.23forA = i, orP[5 < X< 10] 
= P[X <, 10] - P[X s 5] = (I - e~ uo ) - (I - e~ xs ) = e _1 - e -2 for 

A = f //// 

The probability density function is used to calculate the probability of 
events defined in terms of the corresponding continuous random variable X. 
For example, P[a < X <, b] = dx for a<b. 

Definition 9 Probability density function Any function /( ■) with domain 
the real line and counterdomain [0, co) is defined to be a probability 
density function if and only if 

(i) f{x) > 0 for all x. 

00 I f(x)dx= 1. HI/ 


With this definition we can speak of probability density functions without 
reference to random variables. We might note that a probability density func¬ 
tion of a continuous random variable as defined in Definition 8 does indeed 
possess the two properties in the above definition. 

3.3 Other Random Variables 

Not all random variables are either continuous or discrete, or not all cumulative 
distribution functions are either absolutely continuous or discrete. 
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Fx ( x ) = (1 — pe j )(- x ) 



EXAMPLE 8 Consider the experiment of recording the delay that a motorist 
encounters at a one-way traffic stop sign. Let X be the random variable 
that represents the delay that the motorist experiences after making the 
required stop. There is a certain probability that there will be no oppos¬ 
ing traffic so that the motorist will be able to proceed with no delay. On 
the other hand, if the motorist has to wait, he may have to wait for any of 
a continuum of possible times. This experiment could be modeled by 
assuming that X has a cumulative distribution function given by F x (x ) 
= (1 - pe~ Xx )I (0 , *)(*)■ This F x (x) has a jump of 1 — p at x — 0 but is 
continuous for x > 0. See Fig. 5. //// 

Many practical examples of cumulative distribution functions that are 
partly discrete and partly absolutely continuous can be given. Yet there are 
still other types of cumulative distribution functions. There are continuous 
cumulative distribution functions, called singular continuous, whose derivative 
is 0 at almost all points. We will not consider such distribution functions other 
than to note the following result. 


Decomposition of a cumulative distribution function Any cumulative 
distribution function F(x) may be represented in the form 

F(x) = p t F d (x) + p 2 F* c (x) + p 3 F sc (x), where p, > 0, / = 1, 2, 3. (3) 

3 

X>i = L and F d (-), F ac ('), andF sc (-) are each cumulative distribution functions 

i= 1 

with F d (-) discrete, F ac (-) absolutely continuous, and F sc (-) singular continuous. 

Cumulative distributions studied in this book will have at most a discrete 
part and an absolutely continuous part; that is, the p 3 in Eq. (3) will always be 0 
for the F(-) that we will study. 
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EXAMPLE 9 To illustrate how the decomposition of a cumulative distribu¬ 
tion function can be implemented, consider F x (x ) = (1 — pe~* x )I [0y cc)( x ) 
as in Example 8. F x (x) = (1 - p)F d (x) + pF*‘(x), where F d (x) = / [0i X) (x) 
and F*\x) = (1 - e“ Ax )/ [0i X) (x). Note that F x (x) = (1 - p)F d (x) + 
pF* c (x) = (1 - p)I { o, X) (x) +p( 1 - e~ Jix )I [ o, oo/x) = (1 - pe~ Xx )I [0: X) (x). 

//// 

A density function corresponding to a cumulative distribution that is 
partly discrete and partly absolutely continuous could be defined as follows: 
If F(x) = (1 - p)F A (x) + pF ac (x), where 0 <p< 1 and F d (-) and F ac (-) are, 
respectively, discrete and absolutely continuous cumulative distribution func¬ 
tions, let the density function f(x) corresponding to F(x) be defined by f(x) 
= (1 — p)f d (x) + pf* c (x), where/“*(•) is the discrete density function corre¬ 
sponding to F A (-) and/ ac (-) is the probability density function corresponding to 
F ac (-). Such a density function would require careful interpretation; so when 
considering cumulative distribution functions that are partly discrete and 
partly continuous, we will tend to work with the cumulative distribution func¬ 
tion itself rather than with a density function. 

Remark In future chapters we will frequently have to state that a 
random variable has a certain distribution. We will make such a state¬ 
ment by giving either the cumulative distribution function or the density 
function of the random variable of interest. //// 


4 EXPECTATIONS AND MOMENTS 

An extremely useful concept in problems involving random variables or distri¬ 
butions is that of expectation. The subsections of this section give definitions 
and results regarding expectations. 


4.1 Mean 

Definition 10 Mean Let A be a random variable. The mean of X, 
denoted by p x or <%[X]. is defined by: 

(i) S[X] = X Xjf x ( Xj ) 

if X is discrete with mass points x u x 2 ,..., x 3 ,_ 


( 4 ) 
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(ii) S[X] = f xf x (x)dx 
if X is continuous with probability density function/ X (x). 

(iii) d[X] = f°°[l - F x (x )] dx - f F x (x ) dx 

J Q J — oo 

for an arbitrary random variable X. 


(5) 

( 6 ) 

//// 


In (i), S[X] is defined to be the indicated series provided that the series is 
absolutely convergent; otherwise, we say that the mean does not exist. And in 
(ii), <o[X] is defined to be the indicated integral if the integral exists; otherwise, 
we say that the mean does not exist. Finally, in (iii), we require that both 
integrals be finite for the existence of &[X]. 

Note what the definition says: In £ Xjf x (xj), the summand is they'th value 

j 

of the random variable X multiplied by the probability that X equals that yth 
value, and then the summation is overall values. So<f|X|isan “average” of the 
values that the random variable takes on, where each value is weighted by the 
probability that the random variable is equal to that value. Values that are 
more probable receive more weight. The same is true in integral form in (ii). 
There the value x is multiplied by the approximate probability that X equals 
the value x, namely f x (x) dx, and then integrated over all values. 

Several remarks are in order. 


Remark In the definition of a mean of a random variable, only density 
functions [in (i) and (ii)] or distribution functions [in (iii)] were used; 
hence we have really defined the mean for these functions without reference 
to random variables. We then call the defined mean the mean of the 
cumulative distribution function or of the appropriate density function. 
Hence, we can and will speak of the mean of a distribution or density 
function as well as the mean of a random variable. //// 


Remark S[X] is the center of gravity ( 0 r centroid) of the unit mass that 
is determined by the density function of X. So the mean of A is a meas¬ 
ure of where the values of the random variable X are “centered.” Other 
measures of “location or “center” of a random variable or its corre¬ 
sponding density are given in Subsec. 4.6. jnt 
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Remark (iii) of the definition is for all random variables; whereas, 
(i) is for discrete random variables, and (ii) is for continuous random 
variables. Of course, S[X] could have been defined by just giving (iii). 
The reason for including (i) and (ii) is that they are more intuitive for 
their respective cases. It can be proved, although we will not do it, that 
(i) follows from (iii) in the case of discrete random variables and (ii) follows 
from (iii) in the case of continuous random variables. Our main use 
of (iii) will be in finding the mean of a random variable X that is neither 
discrete nor continuous. See Example 12 below. //// 

EXAMPLE 10 Consider the experiment of tossing two dice. Let X denote 
the total of the two dice and Y their absolute difference. The discrete 
density functions for X and Y are given in Example 6. 

= I yj/riyj) = I m =o • a +1 • u 

i = 0 

+ 2 -A + 3 '^+4-T6+5-^ = i|. 

m= I ifxiO = 7. 

1=2 

Note that«?[ Y] is not one of the possible values of Y. //// 

EXAMPLE 11 Let X be a continuous random variable with probability 
density function f x (x ) = Xe~ Xx I l0i «,)(*). 

<o[X ] = f xf x (x ) dx = f xke~ Xx dx = -. 

•'-oo a 

The corresponding cumulative distribution function is 

F x (x) = (1 - <r A *)/ t 0 , oo,W; so S[X] = f°[l - F X (X)] dx 

J 0 

- f° F x (x) dx = f ”(1 - 1 + e~ Xx ) dx = 1/2. //// 

* oo J o 


EXAMPLE 12 Let I be a random variable with cumulative distribution 
function given by F x (x) = (1 -pe~ Xx )I l0t ^{x)\ then 

S[X] = f [1 - F x (x)] dx - f F x (x) dx = f pe~ Xx dx = -■ 

Jo J -oo •'0 A 

Here, we have used Eq. (6) to find the mean of a random variable that is 
partly discrete and partly continuous. Ill/ 



4 


EXPECTATIONS AND MOMENTS 67 


EXAMPLE 13 Let A be a random variable with probability density function 
given byf x (x) = x“ 2 / tloo) (;t); then 

A 00 rfx 

g[X] = I * — = lim log* b = 00 , 

^ 1 X b~* oc 

so we say that S[X] does not exist. We might also say that the mean of X 
is infinite since it is clear here that the integral that defines the mean is 
infinite. //// 


4.2 Variance 

The mean of a random variable X, defined in the previous subsection, was a 
measure of central location of the density of X. The variance of a random vari¬ 
able X will be a measure of the spread or dispersion of the density of X. 


Definition 11 Variance Let X be a random variable, and let p x be 
&[X\. The variance of X, denoted by a x or var [X], is defined by 

(0 var I*] = I ( x j ~ bxffxixj) (7) 

j 

if X is discrete with mass points x l ,x 2 ,...,x J . 

(ii) var [X] = f (x - p x ) 2 f x (x) dx (8) 

* — OC' 


if X is continuous with probability density function f x (x). 

(iii) var [A] = f 2x[l - F x (x) + F x (-x)} dx - p x (9) 

J o 

for an arbitrary random variable X. //// 


The variances are defined only if the series in (i) is convergent or if the 
integrals in (ii) and (iii) exist. Again, the variance of a random variable is 
defined in terms of the density function or cumulative distribution function of 
the random variable; hence variance could be defined in terms of these functions 
without reference to a random variable. 

Note what the definition says: In (i), the square of the difference between 
the yth value of the random variable X and the mean of X is multiplied by the 
probability that X equals the ;th value, and then these terms are summed. 
More weight is assigned to the more probable squared differences. A similar 
comment applies for (ii). Variance is a measure of spread since if the values 
of a random variable X tend to be far from their mean, the variance of X will 
be larger than the variance of a comparable random variable Y whose values 
tend to be near their mean. It is clear from (i) and (ii) and true for (iii) that 
variance is nonnegative. We saw that a mean was the center of gravity of a 
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density; similarly (for those readers familiar with elementary physics or me¬ 
chanics), variance represents the moment of inertia of the same density with 
respect to a perpendicular axis through the center of gravity. 

Definition 12 Standard deviation If J is a random variable, the 
standard deviation of A, denoted by o x , is defined as + v /var [X], //// 

The standard deviation of a random variable, like the variance, is a meas¬ 
ure of the spread or dispersion of the values of the random variable. In many 
applications it is preferable to the variance as such a measure since it will have 
the same measurement units as the random variable itself. 


EXAMPLE 14 Let X be the total of the two dice in the experiment of tossing 
two dice. 

var [X] = XCXj- - Hx) 2 fx( x j) 

= (2 - V) 2 ^ + (3 - 7) 2 ^ + (4 - 7) 2 ^ + (5 - 7) 2 ^ 

+ (6 - 7) 2 xg + (7 - 7) 2 ^ + (8 - 7) 2 ^ + (9 - 7 ) 2 ^ 

+ ( 10 - 7 ) 2 32 s +( 11 - 7 ) 2 3 % + ( 12 - 7 ) 2 * = ^. //// 

EXAMPLE 15 Let A be a random variable with probability density given by 
fx(x) = le~ Xx I [0l X) (x); then 

r°° 

Var [X] = j Jx-p x ) 2 f x (x)dx 

-/„”(*■ t) Xe '“ dx 
_ 

~~x 2 ' illl 


EXAMPLE 16 Let A be a random variable with cumulative distribution 
given by F x (x ) = (1 -pe~ 2x )I [0 _ oc) (x); then 


Var [A] = f 2x[l - F(x) + -F(-x)] dx — fi x 
J o 

= f 2 xpe~ Xx dx- 

J n 




P( 2 ~ p) 
X 2 


2 


//// 
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4.3 Expected Value of a Function of a Random Variable 

We defined the expectation of an arbitrary random variable X, called the mean 
of X, in Subsec. 4.1. In this subsection, we will define the expectation of a 
function of a random variable for discrete or continuous random variables. 

Definition 13 Expectation Let A' be a random variable and <?(•) be 
a function with both domain and counterdomain the real line. The 
expectation or expected value of the function g(-) of the random variable 
X, denoted by &[g(X)\, is defined by: 

(0 £{g(X)\ = lg(xj)fx(xj) (10) 

j 

if X is discrete with mass points x u x 2 , ..., xj, ... (provided this 
series is absolutely convergent). 

(ii) £[g(X)] = f g(x)f x (x) dx (11) 

J -OD 

if X is continuous with probability density function f x (x) (provided 
1 -J 9 (x)\f x {x) dx < oo).* //// 

Expectation or expected value is not really a very good name since it is 
not necessarily what you “expect.” For example, the expected value of a 
discrete random variable is not necessarily one of the possible values of the 
discrete random variable, in which case, you would not “expect” to get the 
expected value. A better name might be “ average value ” rather than “ expected 
value.” 

Since S[g{X)] is defined in terms of the density function of X, it could be 
defined without reference to a random variable. 

Remark If g(x) = x, then £[g(X)] = $[X] is the mean of X. If g(x) = 
(x-px) 2 , then S[g{X)] = £[(X - p x ) 2 ] = V ar [X]. //// 

* <£[g(X)] has been defined here for random variables that are either discrete or 
continuous; it can be defined for other random variables as well. For the reader 
who is familiar with the Stieltjes integral, <£[g{X)] is defined as the Stieltjes integral 
J-° og{x) dF x (x) (provided this integral exists), where F x ( m ) is the cumulative distribu¬ 
tion function of X. If A' is a random variable whose cumulative distribution function is 
partly discrete and partly continuous, then (according to Subsec. 3.3) F x (x) = 
(1 — />)F d (*) + pF ac (x) for some 0 < p < I. Now &[g(X)] can be defined to be &[g(X)) 
= 0 P) 2 9{xj)f 6 ( x j) + P J- ^g{x)f 3c {x) dx, where / d (') is the discrete density func¬ 
tion corresponding to F d (-) and / ac (‘) is the probability density function corre¬ 
sponding to F ac (*)« 
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Theorem 3 Below are properties of expected value: 

(i) S[c\ = c for a constant c. 

(ii) $[cg(X)] = c$[g(X)] for a constant c. 

(iii) S[c l g l {X) + c 2 g 2 (X)\ = C \&\9\(X)\ + c 2 &[g 2 (X)]. 

(iv) £[gi{X)] < <$[g 2 (X)] if g t (x) < g 2 (x) for all x. 

proof Assume X is continuous. To prove (i), take g(x) = c, then 
«%(*)] = £[c) = f cf x (x) dx = c f f x (x) dx = c. 

J — 00 oo 

£[cg(X)] = f cg(x)f x (x) dx = c\ g(x)f x (x) dx = cS[g(X)\, 

which proves (ii). (iii) is given by 

£[ciffi(X) + c 2 g 2 (X)] = f [c,g,{x) + c 2 g 2 (x)]f x (x) dx 
*'-00 

C°° f°° 

= c, g { (x)f x (x) dx + c 2 I g 2 (x)f x (x) dx 

•'—00 * / — 00 
= c^[^W] + c 2 ^[^(A)]. 

Finally, 

0 < Slg 2 {X) - g,(X)] = S[g 2 {X)} - S[g,{X)}, 

which gives (iv). 

Similar proofs could be presented for the discrete random variable 
case. //// 

Theorem 4 If A' is a random variable, var [AT] = £[(X — $ [A]) 2 ] = 
$ [A 2 ] — («? [AT]) 2 provided S’lX 2 ] exists. 

proof (We first note that if &[X 2 ] exists, then <a[X] exists.)* By 
our definitions of variance and S[g{X)], it follows that var [A] = 
&[(X - S[X]) 2 }. Now S[{X - [A]) 2 ] = S[X 2 - 2XS[X] + {S[X)) 2 } = 
d[X 2 ]- 2(S[X\) 2 + (£[X]) 2 = <S[X 2 ] - (d[X]) 2 . //// 

The above theorem provides us with two methods of calculating a vari¬ 
ance, namely £[(X - /i*) 2 ] or ^[X 2 ] — g 2 . Note that both methods require g x . 


• Here and in the future we are not going to concern ourselves with checking existence. 
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fi[g(X)] is used in each of the following three subsections. In Subsec. 4.4 
and 4.5 two inequalities involving $[g(X )] are given. Definitions and examples 
of S[g(X)\ for particular functions g(-) are given in Subsec. 4.6. 

4.4 Chebyshev Inequality 

Theorem 5 Let X be a random variable and g(-) a nonnegative function 
with domain the real line; then 

: P[g(X) > k] < - for every k> 0. (12) 

proof Assume that A is a continuous random variable with 
probability density function then 

£[g(.X)] = f g(x)f x (x) dx = i g(x)f x (x) dx 

J -00 J (*:»(*)&:*) 

+ f g(x)f x (x) dx > f ff(x)f x (x) dx 

J (x:g(x)<k) J tx:g(x)^k) 

> f kf x (x) dx = kP[g(X) > k], 

J (x: tl(x)^k) 

Divide by k, and the result follows. A similar proof holds for X discrete. 

//// 

Corollary Chebyshev inequality If A is a random variable with finite 

variance, 

P[ \X- g x \ > ra x ] = P[{X - n x ) 2 Sr r 2 a]^ < p for every r > 0. (13) 
proof Take#(x) = (x - g x ) 2 and k = r 2 a 2 x in Eq. (12) of Theorem 5. 

Illl 

Remark If A is a random variable with finite variance, 

P W x ~Hx\ <ra x ]> 1 -p 2 , (14) 

which is just a rewriting of Eq. (13). //// 

The Chebyshev inequality is used in various ways. We will use it later to 
prove the law of large numbers. Note what Eq. (14) says: 


1 
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that is, the probability that X falls within ra x units of g x is greater than or 
equal to 1 - 1 /r 2 . For r = 2, one gets P[g x - 2a x < X < fi x + 2a x ] > |, or 
for any random variable X having finite variance at least three-fourths of the 
mass of X falls within two standard deviations of its mean. 

Ordinarily, to calculate the probability of an event described in terms of 
a random variable X, the distribution or density of X is needed; the Chebyshev 
inequality gives a bound, which does not depend on the distribution of X, for the 
probability of particular events described in terms of a random variable and 
its mean and variance. 


4.5 Jensen Inequality 

Definition 14 Convex function A continuous function g(-) with domain 
and counterdomain the real line is called convex if for every x 0 on the 
real line, there exists a line which goes through the point (x 0 , g(x 0 )) and 
lies on or under the graph of the function g(-). //// 

Theorem 6 Jensen inequality Let X be a random variable with mean 
£[X], and let g() be a convex function; then <%[g(X)] > g{S[X\). 

proof Since g(x ) is continuous and convex, there exists a line, say 
l(x) = a + bx, satisfying l(x) = a + bx < g(x) and 1{S[X\) = g{S[X]). 
l(x) is a line given by the definition of continuous and convex that goes 
through the point (&[X], g(<a[X])). Note that <$[l(X)\ — <$[{a + bX)] = 
a + bS[X\ = mX])-, hence g{S[X}) = 1{S[X]) = «?[/(*)] < <$[g(X)] [using 
property (iv) of expected values (see Theorem 3) for the last inequality], 

mi 

The Jensen inequality can be used to prove the Rao-Blackwell theorem to 
appear in Chap. VII. We point out that, in general, < $[g{X)\ # g{S[X] ); for 
example, note that g(x) = x 2 is convex; hence ^[X 2 ] > (S’lX]) 2 , which says 
that the variance of X, which is $[X 2 ] — (<f [A]) 2 , is nonnegative. 


4.6 Moments and Moment Generating Functions 

The moments (or raw moments) of a random variable or of a distribution are 
the expectations of the powers of the random variable which has the given 
distribution. 
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Definition 15 Moments If A is a random variable, the rth moment of 
X, usually denoted by Hr , is defined as 

Hr = SIX') (15) 

if the expectation exists. //// 

Note that h\ = <f[A] = Hx > the mean of X. 

Definition 16 Central moments If A' is a random variable, the rth 
central moment of X about a is defined as S[(X — af). If a = Hx , we 
have the rth central moment of X about Hx > denoted by fi r > which is 

Hr = S[(X - HxYl (16) 

//// 

Note that Hi = S[(X — £**)] = 0 and Hi — S[(X — Hx) 2 )* the variance of X. 
Also, note that all odd moments of X about Hx are 0 if the density function of X 
is symmetrical about Hx > provided such moments exist. 

In the ensuing few paragraphs we will comment on how the first four 
moments of a random variable or density are used as measures of various 
characteristics of the corresponding density. For some of these characteristics, 
other measures can be defined in terms of quantiles. 

Definition 17 Quantile The qth quantile of a random variable X or of 
its corresponding distribution is denoted by and is defined as the 
smallest number £ satisfying F x (g) > q. //// 

If A" is a continuous random variable, then the qth quantile of X is given as 
the smallest number i satisfying F x (^) = q. See Fig. 6. 

Definition 18 Median The median of a random variable X, denoted by 
Hied*, med (A), or ^ 50 , is the .5th quantile. //// 

Remark In some texts the median of X is alternatively defined as any 
number, say med (A), satisfying P[X <, med (A)] > ^ and P[X > med (A)] 

llll 


If A is a continuous random variable, then the median of A satisfies 


-meo(X) 

i j x (x) dx=±= r 

-oo 


med (X) 


fx(x) dx; 
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FIGURE 6 


so the median of X is any number that has half the mass of X to its right and 
the other half to its left, which justifies use of the word “median.” 

We have already mentioned that £ fjf], the first moment, locates the “ center ” 
of the density of X. The median of X is also used to indicate a central 
location of the density of X. A third measure of location of the density of X, 
though not necessarily a measure of central location, is the mode of X, which is 
defined as that point (if such a point exists) at which f x (•) attains its maximum. 
Other measures of location [for example, Kf .25 + f. 75 )] could be devised, but 
three, mean, median, and mode, are the ones commonly used. 

We previously mentioned that the second moment about the mean, the 
variance of a distribution, measures the spread or dispersion of a distribution. 
Let us look a little further into the manner in which the variance characterizes 
the distribution. Suppose that /[(x) and f 2 {x) are two densities with the same 
mean /i such that 

f + Vi(*) -/ 2 O*)] * 0 (17) 

for every value of a. Two such densities are illustrated in Fig. 7. It can be 
shown that in this case the variance a\ in the first density is smaller than the 
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FIGURE 8 


variance a\ in the second density. We shall not take the time to prove this in 
detail, but the argument is roughly this: Let 

g(x) =f(x) 

00 

where f(x) and f 2 (x) satisfy Eq. (17). Since J g{x) dx = 0, the positive area 

— 00 

between g(x) and the x axis is equal to the negative area. Furthermore, in 
view of Eq. (17), every positive element of area g{x') dx' may be balanced by a 
negative element g(x") dx" in such a way that x" is further from g than 
When these elements of area are multiplied by ( x — g) 2 , the negative elements 
will be multiplied by larger factors than their corresponding positive elements 
(see Fig. 8); hence 

(x — g) 2 g(x) dx< 0 
0 

unless fi(x) and f 2 (x) are equal- Thus it follows that a\ <a\. The converse 
of these statements is not true. That is, if one is told that a\ <a\, he cannot 
conclude that the corresponding densities satisfy Eq. (17) for all values of a; 
although it can be shown that Eq. (17) must be true for certain values of a. 
Thus the condition aj < a\ does not give one any precise information about 
the nature of the corresponding distributions, but it is evident that/^x) has 
more area near the mean than/ 2 (x), at least for certain intervals about the mean. 

We indicated above how variance is used as a measure of spread or 
dispersion of a distribution. Alternative measures of dispersion can be defined 
in terms of quantiles. For example { 75 — f 25 , called the interquartile range, 
is a measure of spread. Also, — fi-p for some \<p< 1 is a possible 
measure of spread. 

The third moment g 3 about the mean is sometimes called a measure of 
asymmetry, or skewness. Symmetrical distributions like those in Fig. 9 can be 
shown to have g 3 = 0. A curve shaped likeyi(x) in Fig. 10 is said to be skewed 
to the left and can be shown to have a negative third moment about the mean; 
one shaped like fzix) is called skewed to the right and can be shown to have a 
positive third moment about the mean. Actually, however, knowledge of the 
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third moment gives almost no clue as to the shape of the distribution, and we 
mention it mainly to point out that fact. Thus, for example, the density f 3 (x) 
in Fig. 10 has fi 3 = 0, but it is far from symmetrical. By changing the curve 
slightly we could give it either a positive or negative third moment. The ratio 
ix 3 /a 3 , which is unitless, is called the coefficient of skewness. 

The quantity o = (mean — median)/(standard deviation) provides an 
alternative measure of skewness. It can be proved that — 1 ^ o <, 1. 

The fourth moment about the mean is sometimes used as a measure of 
excess or kurtosis, which is the degree of flatness of a density near its center. 
Positive values of nja* — 3, called the coefficient of excess or kurtosis, are 
sometimes used to indicate that a density is more peaked around its center than 
the density of a normal curve (see Subsec. 3.2 of Chap. Ill), and negative values 
are sometimes used to indicate that a density is more flat around its center than 
the density of a normal curve. This measure, however, suffers from the same 
failing as does the measure of skewness; namely, it does not always measure 
what it is supposed to. 

While a particular moment or a few of the moments may give little 
information about a distribution (see Fig. 11 for a sketch of two densities having 
the same first four moments. See Ref. 40. Also see Prob. 30 in Chap. III.), 
the entire set of moments (ji[, fi' 2 , n ’ 3 ,...) will ordinarily determine the distri- 



FIGURE 10 
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bution exactly, and for this reason we shall have occasion to use the moments 
in theoretical work. 

In applied statistics, the first two moments are of great importance, as 
we shall see, but the third and higher moments are rarely useful. Ordinarily 
one does not know what distribution function one is working with in a practical 
problem, and often it makes little difference what the actual shape of the distri¬ 
bution is. But it is usually necessary to know at least the location of the 
distribution and to have some idea of its dispersion. These characteristics can 
be estimated by examining a sample drawn from a set of objects known to have 
the distribution in question. This estimation problem is probably the most 
important problem in applied statistics, and a large part of this book will be 
devoted to a study of it. 

We now define another kind of moment , factorial moment. 

Definition 19 Factorial moment If A' is a random variable, the rth 

factorial moment of X is defined as (r is a positive integer): 

*[*(*-l)"-(*-r+l)]. (18) 

//// 

For some random variables (usually discrete), factorial moments are 
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easier to calculate than raw moments. However the raw moments can be 
obtained from the factorial moments and vice versa. 

The moments of a density function play an important role in theoretical 
and applied statistics. In fact, in some cases, if all the moments are known, 
the density can be determined. This will be discussed briefly at the end of 
this subsection. Since the moments of a density are important, it would be 
useful if a function could be found that would give us a representation of all 
the moments. Such a function is called a moment generating function. 

Definition 20 Moment generating function Let T be a random vari¬ 
able with density f x f ). The expected value of e tX is defined to be the 
moment generating function of X if the expected value exists for every 
value of t in some interval —h<t<h-,h>0. The moment generating 
function, denoted by m x (t) or m(t), is 

m(t) = £[e' x ] = f e tx f x (x) dx (19) 

J “00 

if the random variable X is continuous and is 

m(?) = £[e tX ) = £ e tx f x (pc) 

X 

if the random variable is discrete. //// 

One might note that a moment generating function is defined in terms of 
a density function, and since density functions were defined without reference 
to random variables (see Definitions 6 and 9), a moment generating function 
can be discussed without reference to random variables. 

If a moment generating function exists, then m(t) is continuously differ¬ 
entiable in some neighborhood of the origin. If we differentiate the moment 
generating function r times with respect to t, we have 

jr r m(t)= ( X r e xt f x (x)dx, (20) 

and letting t-* 0, we find 

^ m(0) = SIX'] = /(', (21) 

where the symbol on the left is to be interpreted to mean the rth derivative of 
m(t) evaluated as t -*■ 0. Thus the moments of a distribution may be obtained 
from the moment generating function by differentiation, hence its name. 

If in Eq. (19) we replace e xt by its series expansion, we obtain the series 
expansion of m(t) in terms of the moments of f x (f', thus 
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m(0 


= s [l + xt + i (*o 2 + ^ (*0 3 + • • •] 


— 1 + f*\t “*" 2!^ 2 * 2 " 


I 

= I 

;=o I! 


( 22 ) 


from which it is again evident that p' may be obtained from m(t)\ p' is the co¬ 
efficient of t T jr\. 


EXAMPLE 17 Let A be a random variable with probability density function 
given by/jf(x) = Ae -A */ [0 , «,,(*). 

.00 X 

m x (t) = &[e tx ] = f e ,x Xe~ Xx dx = -- for t < l. 

Jo A — t 

m '(t) = dj jf = hence m\ 0) = <?[X] = \. 

And W "(,) = _^L 3 , so m"(0) — <?[X 2 ] = //// 


EXAMPLE 18 Consider the random variable X having probability density 
function f x {x) = x X) (x). (See Example 13.) If the moment generat¬ 
ing function of X exists, then it is given by x~ 2 e tx dx. It can be 
shown, however, that the integral does not exist for any t > 0, and hence 
the moment generating function does not exist for this random variable X. 

mi 

As with moments, there is also a generating function for factorial moments. 

Definition 21 Factorial moment generating function Let L be a ran¬ 
dom variable. The factorial moment generating function is defined as 
<g[t x ] if this expectation exists. jjjj 

The factorial moment generating function is used to generate factorial 
moments in the same way as the raw moments are obtained from $[e tx ] except 
that t approaches 1 instead of 0. It sometimes simplifies finding moments 
of discrete distributions. 
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EXAMPLE 19 Suppose X has a discrete density function given by 

e~ x k x 

fx (*)= —forx=0, 1,2,.... 
x! 


Then 


<*> t x„-Xyx 

£[t x ] = I 

x = 0 X \ 


I ) 


= hence 

dt at at 


= A. 


IllI 


In addition to raw moments, central moments, and factorial moments, 
there are other kinds of moments, called cumulants, or semi-invariants. Cumu- 
lants will be defined in terms of the cumulant generating function. We will not 
make use of cumulants in this book. 


Definition 22 Cumulant and cumulant generating function The logarithm 
of the moment generating function of X is defined to be the cumulant 
generating function of X. The rth cumulant of X, denoted by K r {X) or K r , 
is the coefficient of t r /r\ in the Taylor series expansion of the cumulant 
generating function. //// 

A moment generating function is used, as its name suggests, to generate 
moments. That, however, will not be its only use for us. An important use 
will be in determining distributions. 


Theorem 7 Let X and Y be two random variables with densities /*(•) 
and / y (•), respectively. Suppose that m x (t) and m Y (t) both exist and are 
equal for all t in the interval -h < t < h for some h > 0. Then the two 
cumulative distribution functions F x ( ) and F r ( ) are equal. //// 

A proof of the above theorem can be obtained using certain transform 
theory that is beyond the scope of this book. We should note, however, what 
the theorem asserts. It says that if we can find the moment generating function 
of a random variable, then, theoretically, we can find the distribution of the 
random variable since there is a unique distribution function for a given moment 
generating function. This theorem will prove to be extremely useful in finding 
the distribution of certain functions of random variables. In particular, see 
Sec. 4 of Chap. V. 
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EXAMPLE 20 Suppose that a random variable X has a moment generating 
function m x (t ) = 1/(1 — t) for — l <t < 1; then we know that the den¬ 
sity of X is given by/ x (jc) = e~ x 7 C o >00) (x) since we showed in Example 
17 above that Xe~ Xx / [0oo) (x) has 2/(2 - t ) for its moment generating 
function. III! 


Problem of moments We have seen that a density function determines a set 
of moments p\, /x' 2 ,... when they exist. One of the important problems in 
theoretical statistics is this: Given a set of moments, what is the density function 
from which these moments came, and is there only one density function that 
has these particular moments? We shall give only partial answers. First, 
there exists a sequence of moments for which there is an infinite (nondenumer- 
able) collection of different distribution functions having these same moments. 
In general, a sequence of moments p\, fi' 2 , ... does not determine a unique 
distribution function. However, we did see that if the moment generating 
function of a random variable did exist, then this moment generating function 
did uniquely determine the corresponding distribution function. (See Theorem 
7 above.) Hence, there are conditions (existence of the moment generating 
function is a sufficient condition) under which a sequence of moments does 
uniquely determine a distribution function. The general problem of whether or 
not a distribution function is determined by its sequence of moments is 
referred to as the problem of moments and will not be discussed further. 


PROBLEMS 


1 (a) Show that the following are probability density functions (p.d.f.’s): 


f{x) = e~*/ ( o.oo)(x) 

fi(x) = 2e- 2 */ (0 . „,(x) 

fix) = i6+ 1 )fix) - df 2 (x) o<e<i. 


ib) Prove or disprove: If fix) and fix) are p.d.f.’s and if d I + d 2 = 1, then 
Oifix) + 6 2 fix) is a p.d.f. 

2 Show that the following is a density function and find its median: 


a 2 (a + 2x) 

f(x) = x 2 (rz —X ) 1 + 


x(2a + x) 

/< 0 . .,(*), for a > 0 


3 Find the constant K so that the following is a p.d.f. 


fi x ) — Kx 2 I ( . k , K)ix). 
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4 Suppose that the cumulative distribution function (c.d.f.) F x (x) can be written 
as a function of (x - a)//}, where a and ft > 0 are constants; that is, x, a, and ft 
appear in F x ( ■) only in the indicated form. 

( a ) Prove that if a is increased by Aa, then so is the mean of X. 

(b) Prove that if ft is multiplied by k(k > 0), then so is the standard deviation 
of X. 

5 The experiment is to toss two balls into four boxes in such a way that each ball 
is equally likely to fall in any box. Let X denote the number of balls in the first 
box. 

(a) What is the c.d.f. of XI 

{,b ) What is the density function of X ? 

(c) Find the mean and variance of X. 

6 A fair coin is tossed until a head appears. Let X denote the number of tosses 
required. 

(a) Find the density function of X. 

{b) Find the mean and variance of X. 

(c) Find the moment generating function (m.g.f.) of X. 

*7 A has two pennies; B has one. They match pennies until one of them has all 
three. Let X denote the number of trials required to end the game. 

(a) What is the density function of X ? 

(b) Find the mean and variance of X. 

(c) What is the probability that B wins the game? 

8 Let f x (x) =(1//?)[1 - |(x - a)//?| .+»>(x), where a and ft are fixed con¬ 

stants satisfying — oo < a < ao and ft > 0. 

(a) Demonstrate that /*(•) is a p.d.f., and sketch it. 

(, b ) Find the c.d.f. corresponding to/*(•)• 

(c) Find the mean and variance of X. 

(d) Find the ?th quantile of X. 

9 Let f x (x) = k(l/ft){l - [(x - <x)lft] 2 }I (a -g, « + „(x), where — do < a < oo and ft > 0. 

(a) Find k so that /*•(■) is a p.d.f., and sketch the p.d.f. 

(b) Find the mean, median, and variance of X. 

(c) Find <S[\X— a|], 

(d) Find the < 7 th quantile of X. 

10 Let f x (x) = K0/<o.i,(x) + /[,. 2 ](x) + (1 —0)1(2. 3) (x)}, where 0 is a fixed constant 
satisfying 0 < 6 <L 1. 

(a) Find the c.d.f. of X. 

( b ) Find the mean, median, and variance of X. 

11 Let f(x; 0) = 0f(x; 1) + (1 — 0)/(x; 0), where 0 is a fixed constant satisfying 
0 < 0 ^ 1. Assume that/(-; 0) and/(■; 1) are both p.d.f.’s. 

(a) Show that f(",6) is also a p.d.f. 

(, b ) Find the mean and variance of/(•; 0) in terms of the mean and variance of 
/(■; 0) and /(■; 1), respectively. 

(c) Find the m.g.f. of/(-; 0) in terms of the m.g.f.’s of/( •; 0) and/(-; 1). 
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12 A bombing plane flies directly above a railroad track. Assume that if a large 
(small) bomb falls within 40 (15) feet of the track, the track will be sufficiently 
damaged so that traffic will be disrupted. Let X denote the perpendicular 
distance from the track that a bomb falls. Assume that 


/*(*) = 


100 - * 
5000 


fio,ioo)(x). 


(a) Find the probability that a large bomb will disrupt traffic. 

(A) If the plane can carry three large (eight small) bombs and uses all three 
(eight), what is the probability that traffic will be disrupted? 

13 (a) Let X be a random variable with mean /x and variance o 1 . Show that 

d?[(X— A) 2 ], as a function of b, is minimized when b = /x. 

*(A) Let X be a continuous random variable with median m. Minimize S[\X — b\] 
as a function of b. Hint: Show that $[\X — b\] — £[\X — m\]-\- 
2 1: (x - b)f x (x) dx. 

14 (a) If X is a random variable such that <$[X] = 3 and <Z[X 2 ] = 13, use the 

Chebyshev inequality to determine a lower bound for P[— 2 < jr < 8]. 

(A) Let X be a discrete random variable with density 

fx(x) =i/{-u(x) + §/j 0 )(x) + {/(!)(*)• 

For k = 2 evaluate P[\X— p x \ >ka x ]. (This shows that in general the 
Chebyshev inequality cannot be improved.) 

(c) If A is a random variable with S[X] = /x satisfying P[X ^ 0] = 0, show that 
P[X > 2/x] <. i. 

15 Let A be a random variable with p.d.f. given by 

fx(x) — | 1 x\lio. 2](X). 

Find the mean and variance of X. 

16 Let X be a random variable having c.d.f. 

Fx(x) =pH{x) + (1 - p)G(x), 
where p is a fixed real number satisfying 0 <p < 1, 


//(*) = */ (0 . ,,(*) + /,,.„,(*), 
and 

G(x) = ix/ (0 . 2] (x) + 7,2. «)(x). 

(a) Sketch F x (x) for p = i. 

(b) Give a formula for the p.d.f. of X or the discrete density function of X, 
whichever is appropriate. 

(c) Evaluate P[X ^ i | X <; 1 ]. 

17 Does there exist a random variable X forwhich P[p, x - 2a x < X < p x + 2o x ] = .6 ? 
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18 An urn contains balls numbered 1, 2, 3. First a ball is drawn from the urn, 
and then a fair coin is tossed the number of times as the number shown on the 
drawn ball. Find the expected number of heads. 

19 If X has distribution given by P[X = 0] = P[X = 2]=p and P[X = 1] = 1 —2 p 
for 0 < p < i, for what p is the variance of X a maximum ? 

20 If X is a random variable for which P[X <. 0] = 0 and S'[X\ = p < oo, prove that 
P[X <, pt] > 1 — 1/r for every t >1. 

21 Given the c.d.f. 

F*(x) = 0 for x < 0 

= x 2 + .2 for0<:;c<.5 

= x for .5 <: x < 1 
= 1 for 1 <,x. 

(а) Express F x (x) in terms of indicator functions. 

(б) Express F x (x) in the form 

aF“(x) + bF*(x), 

where F‘\-) is an absolutely continuous c.d.f. and F a {-) is a discrete c.d.f. 

(c) Find F[.25 < X < .75]. 

(d) Find F[.25 < X < .5], 

22 Let f{x) = 1 - 0 . «,(x). 

{a) Find K such that /(•) is a density function. 

{b) Find the corresponding c.d.f. 

(c) Find P[X >1]. 

23 A coin is tossed four times. Let X denote the number of times a head is followed 
immediately by a tail. Find the distribution, mean, and variance of X. 

24 Let fx(x; 0) = (dx + J)/,_i.i>(x), where 0 is a constant. 

(a) For what range of values of 9 is /*(■; 6) a density function? 

(b) Find the mean and median of X. 

(c) For what values of 6 is var [A'] maximized ? 

25 Let X be a discrete random variable with the nonnegative integers as values. 

oo 

Note that $ [/*] = 2 t J P[X = fl. Hence, S\t x \ is a probability generating function 

0 

of X, inasmuch as the coefficient of t J gives P[X = /]. Find S[t x ] for the random 
variable of Probs. 6 and 7. 
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SPECIAL PARAMETRIC FAMILIES OF 
UNIVARIATE DISTRIBUTIONS 


1 INTRODUCTION AND SUMMARY 

The purpose of this chapter is to present certain parametric families of univariate 
density functions that have standard names. A parametric family of density 
functions is a collection of density functions that is indexed by a quantity called 
a parameter. For example, let/(x; A) = Ae~ Ax / (0i X) (x), where A > 0; then for 
each X > 0, /(•; A) is a probability density function. A is the parameter, and as 
A ranges over the positive numbers, the collection {/(•; A) : A > 0} is a parametric 
family of density functions. 

The chapter consists of three main sections: parametric families of dis¬ 
crete densities are given in one; parametric families of probability density func¬ 
tions are given in another, and comments relating the two are given in the final 
section. For most of the families of distributions introduced, the means, 
variances, and moment generating functions are presented; also, a sketch of 
several representative members of a presented family is often included. A 
table summarizing results of Secs. 2 and 3 is given in Appendix B. 
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2 DISCRETE DISTRIBUTIONS 

In this section we list several parametric families of univariate discrete densities. 
Sketches of most are given; the mean and variance of each are derived, and usually 
examples of random experiments for which the defined parametric family 
might provide a realistic model are included. 

The parameter (or parameters) indexes the family of densities. For 
each family of densities that is presented, the values that the parameter can 
assume will be specified. There is no uniform notation for parameters; both 
Greek and Latin letters are used to designate them. 

2.1 Discrete Uniform Distribution 

Definition 1 Discrete uniform distribution Each member of the family 
of discrete density functions 

t; for x = 1, 2, ..., n\ 

f(x)=f(x;N) = l =^/ (1 . 2 .*>(*), (1) 

0 otherwise J 

where the parameter N ranges over the positive integers, is defined to have 
a discrete uniform distribution. A random variable X having a density 
given in Eq. (1) is called a discrete uniform random variable. IllI 



FIGURE 1 

Density of discrete uniform. 


Theorem 1 If X has a discrete uniform distribution, then <%[X] = 

(N+ l)/2, 

(N 2 - n N . l 

var [X] = ———- , and m x (t) = S[e tX ] = Z T7 • 

iz j = i Jy 
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PROOF 


N I N + \ 

s[X] = = 

j=i J N 2 


var 


n \ /N + l\ 2 

[x] = g[x 2 ] - (<nx» 2 = IP - (-j -) 

_ N(N + 1)(2 N + 1) (N + l) 2 _ (N + 1)(N - 1) 


6 N 


12 


S[e tx ] = 


N 


1^' 

r= i 


i_ 

N' 


llll 


Remark The discrete uniform distribution is sometimes defined in 

density form as /(*; N) = [l/(N + l)]/ (0 . i. N) (x), for N a nonnegative 

integer. If such is the case, the formulas for the mean and variance have 
to be modified accordingly. //// 


2.2 Bernoulli and Binomial Distributions 

Definition 2 Bernoulli distribution A random variable X is defined 
to have a Bernoulli distribution if the discrete density function of X is 
given by 

fx(x) =fx(x;p) 

lp x (\ — p) 1 ~ x for * = 0 or 1'J 

= />*(! -P)' “*/«>.,,(*), (2) 

(o otherwise J 

where the parameter p satisfies 0 < p < 1, 1 - p is often denoted by q. 

llll 


FIGURE 2 


P 

Bernoulli density. C 
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Theorem 2 If X has a Bernoulli distribution, then 

g[X] = p, var [X] = pq, and m x (t) = pe' + q. (3) 


proof £[X] = 0- q+ l- p=p. 

var [X] = S[X 2 ] - (£[X]) 2 = 0 2 • q + l 2 • p - p 2 = pq. 
m x (t) = S\e tx \ = q + pe'. //// 

EXAMPLE 1 A random experiment whose outcomes have been classified 
into two categories, called “success” and “failure,” represented by the 
letters a and /, respectively, is called a Bernoulli trial. If a random 
variable X is defined as 1 if a Bernoulli trial results in success and 0 if 
the same Bernoulli trial results in failure, then X has a Bernoulli distribu¬ 
tion with parameter p = P[success]. //// 


EXAMPLE 2 For a given arbitrary probability space (Q, P[-]> and for A 

belonging to jaf, define the random variable A' to be the indicator function 
of A\ that is, X(a>) = I A (a>); then X has a Bernoulli distribution with 
parameter p = P[ X = 1] = P[A]. //// 


Definition 3 Binomial distribution A random variable X is defined to 
have a binomial distribution if the discrete density function of X is given 
by 


fx(x) =fx(x; n, p) = 



for x = 0, 1, .,., n 
otherwise 


(4) 


f<r x i{o,\ ..}(*)> 





FIGURE 3 
Binomial densities. 
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where the two parameters n and p satisfy 0 < p < 1, n ranges over the 
positive integers, and q = 1 — p- A distribution defined by the density 
function given in Eq. (4) is called a binomial distribution. mi 

Theorem 3 If AT has a binomial distribution, then 
<a[X] = np, var [X] = npq, and m x (t) = (q + pe') n . (5) 

PROOF 

m x {t) = = i 0 (")(PW"* 

= (Pe' + qf- 

Now 

m' x (t) = npe‘(pe ' + q)”~ 1 

and 

m x (t ) = n(n - 1 )(pe‘) 2 (pe‘ + q) n ~ 2 + npe‘(pe‘ + tf)" -1 ; 

hence 

= m^(0) = np 

and 

var [X] = S[X 2 ] - {S[X\f 

= w"(0) - («p) 2 = «(« - I)p 2 + np- (np) 2 = np( 1 - p). //// 

Remark The binomial distribution reduces to the Bernoulli distribution 
when n — 1. Sometimes the Bernoulli distribution is called the point 
binomial. //// 


EXAMPLE 3 Consider a random experiment consisting of n repeated inde¬ 
pendent Bernoulli trials when p is the probability of success o at each 
individual trial. The term “repeated” is used to indicate that the prob¬ 
ability of remains the same from trial to trial. The sample space for 
such a random experiment can be represented as follows: 

^ = {( z i > z 2 » ■ ■ ■, z n ) : = 6 or Z; = /}. 

z, indicates the result of the ith trial. Since the trials are independent, 
the probability of any specified outcome, say {(/, /, d , /, a, <j, ..., /, a)}. 
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is given by qqpqpp ■ ■ ■ qp. Let the random variable X represent the num¬ 
ber of successes in the n repeated independent Bernoulli trials. Now 
P[X = x] = /’[exactly x successes and n-x failures in n trials] = 

/ J|\ _ 

p x q n x for x = 0, 1 ,... , n since each outcome of the experiment that has 


exactly x successes has probability p x q n x and there are 
Hence X has a binomial distribution. 



such outcomes. 


IllI 


EXAMPLE 4 Consider sampling with replacement from an urn containing 
M balls, K of which are defective. Let X represent the number of defec¬ 
tive balls in a sample of size n. The individual draws are Bernoulli trials 
where “defective” corresponds to “success,” and the experiment of 
taking a sample of size n with replacement consists of n repeated inde¬ 
pendent Bernoulli trials where p = P[success] = KjM\ so X has the 
binomial distribution 


cwkf — 


( 6 ) 


which is the same as P[A k \ in Eq. (3) of Subsec. 3.5 of Chap. I, for x = k. 

1111 


The sketches in Fig. 3 seem to indicate that the terms f x (x; n, p) increase 
monotonically and then decrease monotonically. The following theorem states 
that such is indeed the case. 


Theorem 4 Let Jfhavea binomial distribution with density f x (x; n, p); 
then f x (x - 1; n, p) <f x (x; n, p) for x <(n + \)p; f x (x - 1; n, p)> 
f x (x ; n, p) for x > (« + 1 )p, and f^x - 1 \n,p) =/*(*;«, p)\fx = (n + 1 )p 
and (n + 1 )p is an integer, where x ranges over 1 

PR(X)F 


f x (x; n,p ) = n- x+ 1 p _ ^ | (w + l)p - x 

f x (x - 1; n,p) x q xq 

which is greater than 1 if x < (n + 1 )p, smaller than 1 if x > (n + 1 )p, 
and equal to 1 if the integer x should equal (w + 1 )p- //// 
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2.3 Hypergeometric Distribution 

Definition 4 Hypergeometric distribution A random variable X is 
defined to have a hypergeometric distribution if the discrete density function 
of X is given by 


(©M 

for x = 0, 1,.. 

■ * n 


f x (x; M, K, n) = j j'Afj 




lo 

otherwise 


(7) 



where M is a positive integer, A" is a nonnegative integer that is at most M , 
and n is a positive integer that is at most M. Any distribution function 
defined by the density function given in Eq. (7) above is called a hyper¬ 
geometric distribution. //// 


Theorem 5 if A is a hypergeometric distribution, then 


<f[A] = n 


K 

M 


and var [X]= n 


K M - K M 


M M M — 1 


( 8 ) 


PROOF 



using 


given in Appendix A. 
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M = 10; K = 4;n = 4 


M = 10; K = 4; n = 5 


FIGURE 4 

Hypergeometric densities. 


0 1 


2 


4 


5 


x 


*[X(X - 1)] 


n 


= !*(*-1) 
v = n 



= «(« — l) 


JW(A/- 1) 


n 



= «(« — 


l) 


K(K - 1) 
M(M — 1) 


n — 2 


I 

i“ f 


(K- 

to 

5: 

i 

* 

U- 

2)\n-x) 

1 

[M - 2\ 


U-2 ) 

( K - 

2\ IM - 2 — K + 2\ 


A n-2-y ) 



= «(«-!) 


*(* ~ 1) 
M(A/ — 1) 


Hence 


var [X] = g[X 2 ] - (£[X]) 2 = S[X(X - 1)] + g[X\ - (£[X]) 2 

, ,, K(K-l) K 2 K 2 

= n{n — 1)-h n - n 2 —= 

M(M- 1) M M 2 


K 

= n — 
M 

_ nK 
~ M 


, ,.K- 1 nK 

(« - 1 )- .- 7 + 1 - 777 - 

M — 1 M 

(M—K)(M — «)1 


M(M — 1) 


llll 


Remark If we set K/M = p, then the mean of the hypergeometric dis¬ 
tribution coincides with the mean of the binomial distribution, and the 
variance of the hypergeometric distribution is (M - n)/(M - 1) times the 
variance of the binomial distribution. . //// 



2 


DISCRETE DISTRIBUTIONS 93 


EXAMPLE 5 Let X denote the number of defectives in a sample of size n 
when sampling is done without replacement from an urn containing M 
balls, K of which are defective. Then A'has a hypergeometric distribution. 
See Eq. (5) of Subsec. 3.5 in Chap. I. //// 


2.4 Poisson Distribution 

Definition 5 Poisson distribution A random variable X is defined to 
have a Poisson distribution if the density of X is given by 


fx(x)=fx(xU)=-[ 


for x = 0, 1, 2,... 


otherwise 


^( 0 , 1 ,...)( X )> ( 9 ) 


where the parameter X satisfies X > 0. The density given in Eq. (9) is 
called a Poisson density. jjjj 

Theorem 6 Let X be a Poisson distributed random variable; then 


S[X] = X, var [A] = X, and m x (t) = e Xie ‘~'\ (10) 


.607 



FIGURE 5 
Poisson densities. 
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PROOF 


m x (t) = S[e tx ] = £ 


e' x e~) x 


x = 0 XI 

’- 1 i 

x = 0 Xl 


hence. 


and 


So, 


and 


m' x (t) = Xe Ve Ac ‘ 


mi(t) = Xe- x e‘e Xet [Xe‘ + 1], 
$ [X\ = m' x (Qi) — X 


var [X] = S[X 2 ] - (S[X]) 2 = iw£(0) - X 2 = X[X + l] - X 2 = X. 


1111 


The Poisson distribution provides a realistic model for many random 
phenomena. Since the values of a Poisson random variable are the nonnega¬ 
tive integers, any random phenomenon for which a count of some sort is of 
interest is a candidate for modeling by assuming a Poisson distribution. Such 
a count might be the number of fatal traffic accidents per week in a given state, 
the number of radioactive particle emissions per unit of time, the number of 
telephone calls per hour coming into the switchboard of a large business, the 
number of meteorites that collide with a test satellite during a single orbit, 
the number of organisms per unit volume of some fluid, the number of defects 
per unit of some material, the number of flaws per unit length of some wire, 
etc. Naturally, not all counts can be realistically modeled with a Poisson dis¬ 
tribution, but some can; in fact, if certain assumptions regarding the phenomenon 
under observation are satisfied, the Poisson model is the correct model. 

Let us assume now that we are observing the occurrence of certain happen¬ 
ings in time, space, region, or length. A happening might be a fatal traffic 
accident, a particle emission, the arrival of a telephone call, a meteorite col¬ 
lision, a defect in an area of material, a flaw in a length of wire, etc. We will 
talk as though the happenings are occurring in time; although happenings 
occurring in space or length are appropriate as well. The occurrences of the 
happening in time could be sketched as in Fig. 6. An occurrence of a happen¬ 
ing is represented by x ; the sketch indicates that seven happenings occurred 
between time 0 and time t,. Assume now that there exists a positive quantity, 
say v, which satisfies the following: 



2 


DISCRETE DISTRIBUTIONS 95 


0 




FIGURE 6 


-*-4- 


(i) The probability that exactly one happening will occur in a small 
time interval of length h is approximately equal to vh, or P[one happening 
in interval of length h] = vh + o(h). 

(ii) The probability of more than one happening in a small time interval 
of length h is negligible when compared to the probability of just one 
happening in the same time interval, or P[two or more happenings in 
interval of length /;] = o(h). 

(iii) The numbers of happenings in nonoverlapping time intervals are 
independent. 

The term o(h), which is read “ some function of smaller order than /?,” 
denotes an unspecified function which satisfies 

lim 219.0. 

h-0 h 

The quantity v can be interpreted as the mean rate at which happenings occur per 
unit of time and is consequently referred to as the mean rate of occurrence. 

Theorem 7 If the above three assumptions are satisfied, the number of 
occurrences of a happening in a period of time of length t has a Poisson 
distribution with parameter k = vt. Or if the random variable Z(t) 
denotes the number of occurrences of the happening in a time interval 
of length t, then P[Z(t) = z] = e~'\vt) z /z\ for z = 0, 1,2,.... 

We will outline two different proofs, neither of which is mathemati¬ 
cally rigorous. 


proof For convenience, let t be a point in time after time 0; so the 
time interval (0, t] has length t, and the time interval (t, t + h] has length 
h. Let P n (s) = P[Z(s) = n] — /’[exactly n happenings in an interval of 
length jt]; then 

P 0 (t + h) = P[no happenings in interval (0, t + /;]] 

= P[no happenings in (0, t] and no happenings in (t, t + /;]] 

= Pfno happenings in (0, t]]P[no happenings in (t, t + h]] 

= P 0 (l)Po(h), 

using (iii), the independence assumption. 
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Now P[no happenings in (t, t + h]] = 1 — /’[one or more happenings 
in (t, t + h]\ = 1 - /’[one happening in (t, t+h]] — /’[more than one 
happening in ( t, t + h]] = 1 - vh - o(h ) - o(h); so P 0 (t + h) = P 0 (t) 
[1 — vh — o(h ) — o(h)], or 

P 0 (t + h) - P 0 (t) o(h) + o(h) 

- -= - vP 0 (t) - P 0 (t) ---, 

and on passing to the limit one obtains the differential equation P' 0 (t) = 

— vP 0 (t), whose solution is P 0 (t) = e~''\ using the condition / > o (0) = 1. 
Similarly, P x (t + h) = P x (t)P 0 (h) + P 0 (t)P x (h), or P { (t + h) = /’,(<)[1 - vh 

— o(h)] + P 0 (t)[vh + o(h)], which gives the differential equation P\(t) = 

— vP,(t) + vP 0 (t), the solution of which is given by /’,(<) = vte~ v ', using 
the initial condition / > ,(0)=0. Continuing in a similar fashion one 
obtains P' n (t ) = - vP n (t) + vP„_,(t), for n = 2, 3, — 

It is seen that this system of differential equations is satisfied by 
P n (t) = (vt) B e->!. 

The second proof can be had by dividing the interval (0, t ) into, say 
n time subintervals, each of length h = tin. The probability that k 
happenings occur in the interval (0, t) is approximately equal to the prob¬ 
ability that exactly one happening has occurred in each of k of the n 
subintervals that we divided the interval (0, t) into. Now the probability 
of a happening, or “success,” in a given subinterval is vh. Each sub¬ 
interval provides us with a Bernoulli trial; either the subinterval has a 
happening, or it does not. Also, in view of the assumptions made, these 
Bernoulli trials are independent, repeated Bernoulli trials; hence the 
probability of exactly k “successes” in the n trials is given by (see 
Example 3) 


(j)(vAftl - v/i)-* 


vty~ k 

n_ 


which is an approximation to the desired probability that k happenings will 
occur in time interval (0, t). An exact expression can be obtained by 
letting the number of subintervals increase to infinity, that is, by letting n 
tend to infinity: 

1 , ,„r, v‘T-»„ Olfe— 

vtl n * . 

--> e~ vt 9 1--> 1, and (n) k /n k -> 1. 

n J [ n J 

//// 
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Theorem 7 gives conditions under which certain random experiments in¬ 
volving counts of happenings in time (or length, space, area, volume, etc.) can 
be realistically modeled by assuming a Poisson distribution. The parameter v 
in the Poisson distribution is usually unknown. Techniques for estimating 
parameters such as v will be presented in Chap. VII. 

In practice great care has to be taken to avoid erroneously applying the 
Poisson distribution to counts. For example, in studying the distribution of 
insect larvae over some crop area, the Poisson model is apt to be invalid since 
insects lay eggs in clusters entailing that larvae are likely to be found in clusters, 
which is inconsistent with the assumption of independence of counts in small 
adjacent subareas. 


EXAMPLE 6 Suppose that the average number of telephone calls arriving 
at the switchboard of a small corporation is 30 calls per hour, (i) What 
is the probability that no calls will arrive in a 3-minute period ? (ii) What 
is the probability that more than five calls will arrive in a 5-minute interval ? 
Assume that the number of calls arriving during any time period has a 
Poisson distribution. Assume that time is measured in minutes; then 30 
calls per hour is equivalent to .5 calls per minute, so the mean rate of 
occurrence is .5 per minute. P[no calls in 3-minute period] =e~ vf = 
e -(.5) ( 3) = e -«.s * .223. 


» e~ v, (vt) k 

/’[more than five calls in 5-minute interval] = £ -r- 

k= 6 k\ 


oc 



e -(.s )(5) (2 5)* 

T\ 


.042. 


Ill / 


EXAMPLE 7 A merchant knows that the number of a certain kind of item 
that he can sell in a given period of time is Poisson distributed. How 
many such items should the merchant stock so that the probability will be 
.95 that he will have enough items to meet the customer demand for a time 
period of length T1 Let v denote the mean rate of occurrence per unit 
time and K the unknown number of items that the merchant should stock. 
Let X denote the number of demands for this kind of item during the time 
period of length T. The solution requires finding K so that P[X < K] 

> .95 or finding K so that^ [e-'\vT) k /k\\ > .95. In particular, if the 

merchant sells an average of two such items per day, how many should 



98 SPECIAL PARAMETRIC FAMIUES OF UNIVARIATE DISTRIBUTIONS 


III 


he stock so that he will have probability at least .95 of having enough 
items to meet demand for a 30-day month? Find K so that 


or find K so that 


I 

k=0 


£>-< 2 >< 3O >60* 

~k\ 


> .95, 


OO 


I 


<r 6O 60* 

~~k\ 


< .05. 


The desired K can be found using an appropriate Poisson table (e.g., 
Molina, 1942 [45]). It is K = 73. //// 


EXAMPLE 8 Suppose that flaws in plywood occur at random with an average 
of one flaw per 50 square feet. What is the probability that a 4 foot 
x 8 foot sheet will have no flaws? At most one flaw? To get a sol¬ 
ution assume that the number of flaws per unit area is Poisson distributed. 

P[no flaws] = e~^° 32 = e~ b4 as .527. 

P[at most one flaw] = e - - 64 + ,64e - - 64 as .865. //// 

A Poisson density function, like the binomial density, possesses a certain 
monotonicity that is precisely stated in the following theorem. 


Theorem 8 Consider the Poisson density 


and 


e~ x X k 

k\ 


for k = 0, 1, 2, .... 


e~ x X k ~' e~ x X k 
(k - 1)1 < k\ 
e~ x X k ~' __ e~ x X k 
(k - 1)! > k\ 


for k < X, 
for k > X, 


e -k X k 

c k - 1 )! A! 


if X is an integer and k = X. 


PROOF 

e~ x X k ~ 1 /(k - 1)1 _ k 
e~ x X k /kl X ’ 

which is less than 1 if k < X, greater than 1 if k > X, and equal to 1 if A 
is an integer and k = X. //// 
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2.5 Geometric and Negative Binomial Distributions 

Two other families of discrete distributions that play important roles in statistics 
are the geometric (or Pascal) and negative binomial distributions. The reason 
that we consider the two together is twofold; first, the geometric distribution is a 
special case of the negative binomial distribution, and, second, the sum of 
independent and identically distributed geometric random variables is negative 
binomially distributed, as we shall see in Chap. V. In Subsec. 3.3 of this chapter, 
the exponential and gamma distributions are defined. We shall see that in 
several respects the geometric and negative binomial distributions are discrete 
analogs of the exponential and gamma distributions. 

Definition 6 Geometric distribution A random variable X is defined 

to have geometric (or Pascal) distribution if the density of X is given by 

fx(x) =AO; P) 

(p{\ - p) x for x = 0, 1,...) 

\=p(l-p) x I l o, l ,.Jx)’ 01) 
(0 otherwise J 

where the parameter p satisfies 0 < p <, 1. (Define q = 1 - p.) //// 


Definition 7 Negative binomial distribution A random variable X 
with density 

fx(x)=f x (x;r,p) 



where the parameters r and p satisfy r = 1, 2, 3, and 0 < p < 1 
(<7 = 1 — p), is defined to have a negative binomial distribution. The 
density given by Eq. (12) is called a negative binomial density. 

Illl 


Remark If in the negative binomial distribution r = 1, then the negative 
binomial density specializes to the geometric density. //// 
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Theorem 9 If the random variable X has a geometric distribution, 
then 


g[X]=~, var [X] = \, and m x (t)= P - (13) 

P P 1 - qe 

proof Since a geometric distribution is a special case of a negative 
binomial distribution, Theorem 9 is a corollary of Theorem 11. //// 

The geometric distribution is well named since the values that the geometric 
density assumes are the terms of a geometric series. Also the mode of the 
geometric density is necessarily 0. A geometric density possesses one other 
interesting property, which is given in the following theorem. 


Theorem 10 If X has the geometric density with parameter p, then 
P[X > i +j\X > i] = P[X > j] for i,j = 0, 1,2,- 
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PROOF 


P[X>i+j\X>i] = 


P[X>i+j] 
P[X > i] 


I Ml -pf 


x = i + J 
00 


-pf 

■ = i 


= (i - P y 


= P[X>j). 


(i -p) i+J 
(i -pY 


III / 


Theorem 10 says that the probability that a geometric random variable is 
greater than or equal to i + j given that it is greater than or equal to i is equal to 
the unconditional probability that it will be greater than or equal to j. We will 
comment on this again in the following example. 


EXAMPLE 9 Consider a sequence of independent, repeated Bernoulli trials 
with p equal to the probability of success on an individual trial. Let the 
random variable X represent the number of trials required before the first 
success; then X has the geometric density given by Eq. (11). To see this, 
note that the first success will occur on trial x + 1 if this (x + l)st trial 
results in a success and the first * trials resulted in failures; but, by in¬ 
dependence, x successive failures followed by a success has probability 
(1 — pYp. In the language of this example. Theorem 10 states that the 
probability that at least / + j trials are required before the first success, 
given that there have been i successive failures, is equal to the uncon¬ 
ditional probability that at least j trials are needed before the first suc¬ 
cess. That is, the fact that one has already observed / successive failures 
does not change the distribution of the number of trials required to obtain 
the first success. //// 


A random variable X that has a geometric distribution is often referred 
to as a discrete waiting-time random variable, it represents how long (in terms 
of the number of failures) one has to wait for a success. 

Before leaving the geometric distribution, we note that some authors 
define the geometric distribution by assuming 1 (instead of 0) is the smallest 
mass point. The density then has the form 


( 14 ) 
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and the mean is 1 jp, the variance is^/p 2 , and the moment generating function is 
pe '/(! -qe‘). 


Theorem 11 Let X have a negative binomial distribution; then 


rq rq [ p 

= -, var [X] = ~ and m x (t) = -- 

p p Li - qe J 


PROOF 


m x {t) = S[e ,x ] = 

= £ (~ r )p'(-qe'y 

x=o \ x / LI — J 

[see Eq. (33) in Appendix A]. 

™x(t) =/(-r)(l (-?<?') 

and 

m x (t) = rqp r [q(r + l)e 2t (l -qe ')~ r ~ 2 +e’(l -qe'y-']; 

hence 


S[X] = m' x (t) 


= 1 
1=0 P 


and 

var [. X] = m x {t) 


(15) 


- (£[X]) 2 = rqp r [ q p-'- \r+ 1) + p ~ r ~'] - f^V 
r = 0 \P/ 


rq rq _ rq 
P 2 P P 2 ' 


III / 


The negative binomial distribution, like the Poisson, has the non negative 
integers for its mass points; hence, the negative binomial distribution is poten¬ 
tially a model for a random experiment where a count of some sort is of interest. 
Indeed, the negative binomial distribution has been applied in population counts, 
in health and accident statistics, in communications, and in other counts as 
well. Unlike the Poisson distribution, where the mean and variance are the 
same, the variance of the negative binomial distribution is greater than its mean. 
We will see in Subsec. 4.3 of this chapter that the negative binomial distribution 
can be obtained as a contagious distribution from the Poisson distribution. 
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EXAMPLE 10 Consider a sequence of independent, repeated Bernoulli 
trials with p equal to the probability of success on an individual trial. Let 
the random variable X represent the number of failures prior to the rth 
success; then X has the negative binomial density given by Eq. (12), as the 
following argument shows: The last trial must result in a success, having 
probability p\ among the first x + r — 1 trials there must be r — 1 successes 
and x failures, and the probability of this is 


(x + r- 
{ r- 1 




which when multiplied by p gives the desired result. 


Ill/ 


A random variable X having a negative binomial distribution is often 
referred to as a discrete waiting-time random variable. It represents how long 
(in terms of the number of failures) one waits for the rth success. 


EXAMPLE 11 The negative binomial distribution is of importance in the 
consideration of inverse binomial sampling. Suppose a proportion p 
of individuals in a population possesses a certain characteristic. If 
individuals in the population are sampled until exactly r individuals with 
the certain characteristic are found, then the number of individuals in 
excess of r that are observed or sampled has a negative binomial dis¬ 
tribution. an 


2.6 Other Discrete Distributions 


In the previous five subsections we presented seven parametric families of uni¬ 
variate discrete density functions. Each is commonly known by the names 
given. There are many other families of discrete density functions. In fact, 
new families can be formed from the presented families by various processes. 
One such process is called truncation. We will illustrate this process by looking 
at the Poisson distribution truncated at 0. Suppose, as is sometimes the case, 
that the zero count cannot be observed yet the Poisson distribution seems a 
reasonable model. One might then distribute the mass ordinarily given to the 
mass point 0 proportionately among the other mass points obtaining the family 
of densities 




for x = 1, 2,... 
otherwise. 


( 16 ) 
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A random variable having density given by Eq. (16) is called a Poisson 
random variable truncated at 0. 

Another process for obtaining a new family of densities from a given 
family can also be illustrated with the Poisson distribution. Suppose that a 
random variable X, representing a count of some sort, has a Poisson distribu¬ 
tion. If the experimenter is stuck with a rather poor counter, one that cannot 
count beyond 2, the random variable that the experimenter actually observes 
has density given by 


z 

0 

1 

2 

m 


Ae~ 1 

-< 

1 

1 

«•< 

1 

1 


The counter counts correctly values 0 and 1 of the random variable X; but if X 
takes on any value 2 or more, the counter counts 2. Such a random variable 
is often referred to as a censored random variable. 

The above two illustrations indicate how other families of discrete densities 
can be formulated from existing families. We close this section by giving two 
further, not so well-known, families of discrete densities. 


Definition 8 Beta-binomial distribution The distribution with discrete 
density function 


fix) =f(x; n, a, fi) 


/«\ T(a +/?) r(x + ct)r(n + [l-x) 

W T(a)T(/?) r(n + ot + P) (<M 


»}(*) 

(17) 


where n is a nonnegative integer, a > 0, and ft > 0, is defined as the beta- 
binomial distribution. 

T(m) is the well-known gamma function T{m) = x m ~'e~ x dx for 
m > 0. See Appendix A. The beta-binomial distribution has 


Mean = 


not 

a + P 


and 


variance = 


nap(n + a + /?) 

(a+ /J) 2 (a+ /i + l)' 


( 18 ) 


It has the same mass points as the binomial distribution. If a = p = 1, 
then the beta-binomial distribution reduces to a discrete uniform distribu¬ 
tion over the integers 0, 1, //// 
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Definition 9 Logarithmic distribution The distribution with discrete 
density function 

q x e , 

- tor x=l,2 , 

xlog c p 


otherwise 

where the parameters satisfy 0 < p < 1 and q = 1 — p is defined as the 
logarithmic distribution. 1111 . 


f(xlp) = 




— x log e p 


1,2, ...}(■*)> (19) 


The name isjustified if one recalls the power-series expansion oflog^l' — q). 
The logarithmic distribution has v 


Mean =- 

-plog c p 


and 


variance = 


q{g + log c p) 
~(P log eP) 2 ' 


( 20 ) 


It can be derived as a limiting distribution of negative binomial distributions 
that have been generalized to include r, any positive number (rather than just 
an integer), truncated at 0. The limiting distribution is obtained by letting r 
approach 0. 


3 CONTINUOUS DISTRIBUTIONS 

In this section several parametric families of univariate probability density 
functions are presented. Sketches of some are included; the mean and variance 
(when they exist) of each are given. 

3.1 Uniform or Rectangular Distribution 

A very simple distribution for a continuous random variable is the uniform dis¬ 
tribution. It is particularly useful in theoretical statistics because it is convenient 
to deal with mathematically. 

Definition 10 Uniform distribution If the probability density function 
of a random variable X is given by 

/*(*) =fxU; a , b) = I [aM {x), ( 21 ) 
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a b -► x 


FIGURE 8 

Uniform probability density. 


where the parameters a and b satisfy —oo < a < b < o o, then the random 
variable X is defined to be uniformly distributed over the interval [a, b], 
and the distribution given by Eq. (21) is called a uniform distribution. 

IllI 


Theorem 12 If X is uniformly distributed over [a, b], then 

a + b ( b-a) 2 f- (?' 

= var[*]= l —and mx (t) = ^—^. (22) 


PROOF 



b 2 — a 2 
2(b - a) 


a + b 
2 


var [X] = S[X 2 ] - (, g[X\) 2 



b 3 — a 3 (a + b) 2 (b - a) 2 

3 (b - a) 4 _ 12~ 


m x (t) = S[d x ] = 



dx = 


(b - a)t 


III / 


The uniform distribution gets its name from the fact that its density is 
uniform, or constant, over the interval [a, b]. It is also called the rectangular 
distribution—the shape of the density is rectangular. 

The cumulative distribution function of a uniform random variable is 
given by 

Fx( X ) = ^ fa, F]M + fb. oo)W- 


( 23 ) 
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It provides a useful model for a few random phenomena. For instance, if it 
is known that the values of some random variable X can only be in a finite 
interval, say [a, b \, and if one assumes that any two subintervals of [a, 6] of 
equal length have the same probability of containing X, then X has a uniform 
distribution over the interval [a, b]. When one speaks of a random number 
from the interval [0, 1], one is thinking of the value of a uniformly distributed 
random variable over the interval [0, 1]. 

EXAMPLE 12 If a wheel is spun and then allowed to come to rest, the point 
on the circumference of the wheel that is located opposite a certain fixed 
marker could be considered the value of a random variable X that is 
uniformly distributed over the circumference of the wheel. One could 
then compute the probability that X will fall in any given arc. //// 

Although we defined the uniform distribution as being uniformly dis¬ 
tributed over the closed interval [a, b], one could just as well define it over the 
open interval (a, b) [in which case f x (x) = (b — or over either of the 

half-open-half-closed intervals ( a , b] or [a, b). Note that all four of the possible 
densities have the same cumulative distribution function. This lack of unique¬ 
ness of probability density functions was first mentioned in Subsec. 3.2 of 
Chap. II. 


3.2 Normal Distribution 

A great many of the techniques used in applied statistics are based upon the 
normal distribution; it will frequently appear in the remainder of this book. 

Definition II Normal distribution A random variable X is defined to 
be normally distributed if its density is given by 

fx( x ) =fx(x; P, o) = L- e -(*-M> 2 /2(24) 
sJ2no 

where the parameters p and a satisfy - oo < p < oo and cr > 0. Any 
distribution defined by a density function given in Eq. (24) is called a 
normal distribution. j j jj 

We have used the symbols p and a to represent the parameters because 
these parameters turn out, as we shall see, to be the mean and variance, respec¬ 
tively, of the distribution. 
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FIGURE 9 
Normal densities. 



One can readily check that the mode of a normal density occurs at x = ju 
and inflection points occur at n — a and n + a. (See Fig. 9.) Since the normal 
distribution occurs so frequently in later chapters, special notation is introduced 
for it. If random variable X is normally distributed with mean /t and variance 
a 2 , we will write X~ N(n, a 1 ). We will also use the notation <^,„ 2 (x) for the 
density of X~ N(n, a 2 ) and <&,, ^(x) for the cumulative distribution function. 

If the normal random variable has mean 0 and variance 1, it is called a 
standard or normalized normal random variable. For a standard normal ran¬ 
dom variable the subscripts of the density and distribution function notations 
are dropped; that is, 

4,(x) = -^e- ix2 and 0>(x) = f <!>{u)du. (25) 

yJ2n 

Since <£ M>ff 2 (x) is given to be a density function, it is implied that 

r°° 

<t>„,Ax)dx=\, 

* co 

but we should satisfy ourselves that this is true. The verification is somewhat 
troublesome because the indefinite integral of this particular density function 
does not have a simple functional expression. Suppose that we represent the 
area under the curve by A ; then 

A = f e -tx-ri 1 l 2 <’ 1 dx, 
sjlna J 

and on making the substitution y = (x - n)/o, we find that 

A= JTn 


r CO 

I e~ iyl dy. 

^ — an 
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FIGURE 10 

Normal cumulative distribution 
function. 



We wish to show that A = 1, and this is most easily done by showing that A 2 is 
1 and then reasoning that A = 1 since ^.^(x) is positive. We may put 

1-00 1 -00 
A 2 = —-— e~ iy2 dy —== f e~* zl dz 

1 -00 .00 

= — f f e~ i(y2+z2) dy dz 
2n ^ — co ^ — oo 

by writing the product of two integrals as a double integral. In this integral 
we change the variables to polar coordinates by the substitutions 

y = r sin 9 
z = r cos 9, 

and the integral becomes 

1 .» . 2 * 

A 2 = —\ f re~ irl d9 dr 
2n 

-00 

= re~ ir2 dr 
J o 

= 1. 

Theorem 13 If X is a normal random variable, 

S[X] = n, var [X] = <r 2 , and m x (t) = e' itJra2 ' 112 . (26) 

PROOF 

m x {t ) = S[e ,x ] = e'^[e HX -^] 

J o° } 

—— e'^-pJe-O/z^Kx-,.) 2 dx 
-00 yj2n 

1 r 00 

= — -j= e ( 1 / 2ff2 )[(x —^)2 - 2 ff 2 r(x—/i)] dx 

yj2n J -«> 

If we complete the square inside the bracket, it becomes 
(x - n) 2 - 2a 2 t(x - /t) = (x - n) 2 - 2o 2 t{x - n) + a *t 2 - o*t 2 
= (x — (i - a 2 t) 2 _ CT V ( 
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and we have 


m x (t) = e'V 2,2/2 -j— f e -(x-»-° 1 o 2 n° 2 

yjlna •'-w 

The integral together with the factor 1 l-Jlna is necessarily 1 since it is the 
area under a normal distribution with mean fx + a 2 t and variance a 2 . 
Hence, 


m x (i) = e'" +ff2,2/2 . 

On differentiating m x (t) twice and substituting t = 0, we find 

m = m' x ( 0) = n 

and 

var [X] = g[X 2 ) - {S[X]f = m" x { 0) - \x 2 = a 2 , 
thus justifying our use of the symbols ji and a 2 for the parameters. //// 


Since the indefinite integral of 4A x ) does not have a simple functional 
form, one can only exhibit the cumulative distribution function as 

^>„,Ax)= f 4>»,A u )du. (27) 

J -00 

The following theorem shows that we can find the probability that a normally 
distributed random variable, with mean \i and variance a 2 , falls in any interval 
in terms of the standard normal cumulative distribution function, and this 
standard normal cumulative distribution function is tabled in Table 2 of 
Appendix D. 


Theorem 14 If X ~ N(p., a 2 ), then 


(28) 


PROOF 


P[a<X<b]= f‘- 7 -L- e-W*-^ 2 dx 

= r^-L e ^ dz 


//// 
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Remark <D(x) = 1 - <D(-x). //// 

The normal distribution appears to be a reasonable model of the behavior 
of certain random phenomena. It also is the limiting form of many other prob¬ 
ability distributions. Some such limits are given in Subsec. 4.1 of this chapter. 
The normal distribution is also the limiting distribution in the famous central- 
limit theorem , which is discussed in Sec. 4 of Chap. V and again in Sec. 3 of 
Chap. VI. 

Most students are already somewhat familiar with the normal distribution 
because of their experience with “grading on the curve.” This notion is 
covered in the following example. 

EXAMPLE 13 Suppose that an instructor assumes that a student’s final 
score is the value of a normally distributed random variable. If the 
instructor decides to award a grade of A to those students whose score 
exceeds // + o, a B to those students whose score falls between ju and 
H + <r, a C if a score falls between n - o and n, a D if a score falls between 
H — 2o and n — o, and an F if the score falls below /i — 2 a, then the pro¬ 
portions of each grade given can be calculated. For example, since 

P[X> M + o]= 1 -P[X<n + a] = 1 — O ^ 

= 1 - 0(1) « .1587, 

one would expect 15.87 percent of the students to receive A’ s. //// 

EXAMPLE 14 Suppose that the diameters of shafts manufactured by a cer¬ 
tain machine are normal random variables with mean 10 centimeters and 
standard deviation .1 centimeter. If for a given application the shaft must 
meet the requirement that its diameter fall between 9.9 and 10.2 centi¬ 
meters, what proportion of the shafts made by this machine will meet the 
requirement? 

/10.2 - 10\ /9 9 — lo\ 

P[ 9.9 < V< 10.2] = <P i - - -j - 0) I—.... \ 

= 4>(2) - <P(- 1) * .9772 - .1587 = .8185. //// 

3.3 Exponential and Gamma Distributions 

Two other families of distributions that play important roles in statistics are the 
(negative) exponential and gamma distributions, which are defined in this sub¬ 
section. The reason that the two are considered together is twofold; first, the 
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exponential is a special case of the gamma, and, second, the sum of independent 
identically distributed exponential random variables is gamma-distributed, as 
we shall see in Chap. V. 

Definition 12 Exponential distribution If a random variable X has a 
density given by 


A(x;2) = Ae-^/ [0 , oo) (x), (29) 

where k > 0, then X is defined to have an (negative) exponential distribu¬ 
tion. //// 

Definition 13 Gamma distribution If a random variable X has density 
given by 


fx(x; r, k) = A (kxy-'e-**I l0i x) (x), (30) 

where r > 0 and k > 0, then X is defined to have a gamma distribution. 
T(*) is the gamma function and it is discussed in Appendix A. //// 

Remark If in the gamma density r = 1, the gamma density specializes 
to the exponential density. //// 

Theorem 15 If X has an exponential distribution, then 

<?[X] = ^, var [A'] = -j, and m x {t) = for t < k. 

(31) 

proof The exponential distribution was the distribution used as an 
example for some definitions given inChap.II, and derivations of the above 
appear there. Also, Theorem 15 is a corollary to the following theorem. 

//// 

Theorem 16 If X has a gamma distribution with parameters r and k, 
then 

<$[X] =j, var [3f] = , and m x (t) = 


for t < k. 
(32) 
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FIGURE 11 
Gamma densities (A = 1), 


PROOF 


m x (t) = 4*'*] 

X' 

= ' e 
J o T(r) 


0 fx v r 1 a Ax ^ 


J 0 r(r) 
«i(0=rA r (A-0“ r_1 


dx 



and 


w*(0 = r(r + l)A r (A - t)~ r ~ 2 ; 

hence 


(S’tA'] = m*(0) = j 

and 


var [A"] = S[X 2 ] - (£[X]) Z 


//// 


The exponential distribution has been used as a model for lifetimes of 
various things. When we introduced the Poisson distribution, we spoke of cer¬ 
tain happenings, for example, particle emissions, occurring in time. The length 
of the time interval between successive happenings can be shown to have an 
exponential distribution provided that the number of happenings in a fixed 
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time interval has a Poisson distribution. We comment on this again in Subsec. 
4.2 below. Also, if we assume again that the number of happenings in a fixed 
time interval is Poisson distributed, the length of time between time 0 and the 
instant when the rth happening occurs can be shown to have a gamma distribu¬ 
tion. So a gamma random variable can be thought of as a continuous waiting¬ 
time random variable. It is the time one has to wait for the rth happening. 
Recall that the geometric and negative binomial random variables were dis¬ 
crete waiting-time random variables. In a sense, they are discrete analogs of the 
negative exponential and gamma distributions, respectively. 


Theorem 17 If the random variable X has a gamma distribution with 
parameters r and 2, where r is a positive integer, then 


F x (x) = 1 


'- 1 e~ kx (kx) J 
j-o j! 


(33) 


proof The proof can be obtained by successive integrations by 
parts. //// 

For A = 1, F x ( x) given in Eq. (33) is called the incomplete gamma function 
and has been extensively tabulated. 


Theorem 18 If the random variable X has an exponential distribution 
with parameter A, then 

P[X > a + b\X> a] — P[X > b], for a > 0 and b > 0. 


PROOF 


P[X>a + b\X>a\ 


P[X>a + b] 
P[X>a ] 


e -Ma + b) 


e 


— /.a 


= e~* b =P[X>b], 


HU 


Let X represent the lifetime of a given component; then, in words, 
Theorem 18 states that the conditional probability that the component will last 
a + b time units given that it has lasted a time units is the same as its initial 
probability of lasting b time units. Another way of saying this is to say that an 
“old” functioning component has the same lifetime distribution as a "new” 
functioning component or that the component is not subject to fatigue or to 


wear. 
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3.4 Beta Distribution 

A family of probability densities of continuous random variables taking on values 
in the interval (0, 1) is the family of beta distributions. 

Definition 14 Beta distribution If a random variable X has a density 
given by 


fx( x )—fx( x > a > b) — R / ,x *0 x ) V !)(*)> 

B(a, b) 

where a > 0 and b > 0, then X is defined to have a beta distribution. //// 

The function B(a, b) = Jo x a_1 (l — x) 6-1 dx, called the beta function, is 
mentioned briefly in Appendix A. 

Remark The beta distribution reduces to the uniform distribution over 
(0, 1) if a = /> = 1. //// 

Remark The cumulative distribution function of a beta-distributed 
random variable is 

T x (x;a, fi) = / (0 ,i)(x)J o ^-^u a_1 (l du +/ [1>00) (x); (35) 

it is often called the incomplete beta and has been extensively tabulated. 

till 
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The moment generating function for the beta distribution does not have a 
simple form; however the moments are readily found by using their definition. 


Theorem 19 If X is a beta-distributed random variable, then 


£[X] = 


a + b 


and 


var [X] = 


ab 


(u + b + 1 )(# + b ) 


PROOF 


<?[**] = > - x) 6 - 1 dx 

B(a, b) Jo 

B(k + a, b ) F(k + a)T(b) T(a + b) 
- B(a, b) = r (k + a + b) ' r(a)T(A) 
r(Jt + a)T(a + b) 

~ F(a)r(k +a+h)’ 


hence. 


r(a + i)r(a + b) 

r(a)r(a + b + i) 


a 

~b’ 


and 


var [X] = £[X 2 ] - (£[X]) 2 = 
(a + 1 )a 

(a + b + l)(n + b ) 


T(a + 2)r(a + A) ( a V 

T(.a)T(a + b + 2 ) \a + b) 

/ a \ 2 ab 

\a + b) (a + b + 1 )(a + Z>) 2 


//// 


The family of beta densities is a two-parameter family of densities that 
is positive on the interval (0, 1) and can assume quite a variety of different 
shapes, and, consequently, the beta distribution can be used to model an experi¬ 
ment for which one of the shapes is appropriate. 


3.5 Other Continuous Distributions 

In this subsection other parametric families of probability density functions that 
will appear later in this book are briefly introduced; many other families exist. 
The introductions of the three families of distributions, that go by the names of 
Student’s t distribution, chi-square distribution, and F distribution, are de¬ 
ferred until Chap. VI. These three families, as we shall see, are very important 
when sampling from normal distributions. 
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Cauchy distribution A distribution which we shall find useful for illustrative 
purposes is the Cauchy, which has the density 


f x (x - cc, P) = 


np{\ + [(x - x)l[3] 2 } 


(36) 


where — co < a < oo and /? > 0. 

Although the Cauchy density is symmetrical about the parameter a, its 
mean and higher moments do not exist. The cumulative distribution function 
is 


r ( ^ 1 j- 1 du 

Fxix) = ni- x n p{ 1 + [(u - a)lpf} 

11 X — X 

= - + - arc tan —-—. 

2 n p 


(37) 


Lognormal distribution Let A' be a positive random variable, and let a new 
random variable Y be defined as 7= log e X. If Y has a normal distribution, 
then X is said to have a lognormal distribution. The density of a lognormal 
distribution is given by 

/ (x-,g,o 2 ) = — ^-exp^-^(log e x-/r) 2 ]/ (0>00) (x), (38) 

where — oo < p < oo and a > 0. 

S[X ] = and var [ X] = e 2 “ + 2 '’ 2 - e 2 ' i+ff2 (39) 

for a lognormal random variable X. Also, if X has a lognormal distribution, 
then <f[log e (Af)] = g, and var [log e (AQ] = o 2 . 


Double exponential or Laplace distribution A random variable X is said to 
have a double exponential, or Laplace, distribution if the density function of X 
is given by 

fx(x) =A(x; X,P) = L exp | _ 1 * ~ a ^ , (40) 

where — co < a < co and p > 0. If X has a Laplace distribution, then 

S[X] = a and var [X] = 2p 2 . (41) 

Weibull distribution The density 

fix', a, b) = abx b ~ 1 e~ axb I (0i ro) (x) 


( 42 ) 
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where a > 0 and b > 0, is called the Weibull density, a distribution that has been 
successfully used in reliability theory. For 6 = 1, the Weibull density reduces 
to the exponential density. It has mean (l/a) 1/f T(l + b~ l ) and variance 

(l/a) 2 /b [F(l + 2 b~ l ) - r 2 (l + b~ 1 )]. 


Logistic distribution The logistic distribution is given in cumulative distribu¬ 
tion form by 

F(x;cc,P)= i +e _ 1 (x _ a)/<l , (43) 

where — oo < a < oo and /? > 0. The mean of the logistic distribution is given 
by a. The variance is given by /? 2 7t 2 /3. Note that F (a — d; a, p) = 
1 — F(a + d; a, p), and so the density of the logistic is symmetrical about a. 
This distribution has been used to model tolerance levels in bioassay problems. 


Pareto distribution The Pareto distribution is given in density-function form 
by 


0 /x \* 

f x (x;x 0 ,d) = — /(,„,*)(*), 


(44) 


where 6 > 0 and x 0 > 0. The mean and variance respectively of the Pareto 
distribution are given by 


Oxp 

6-1 


for 6 > 1 


and 


6xo_ _ / 6x o V 
-2 \0 - 1 / 


6 


for 9 >2. 


This distribution has found application in modeling problems involving distribu¬ 
tions of incomes when incomes exceed a certain limit x 0 . 


Gumbel distribution The cumulative distribution function 

F(x; a, P) = exp( —e _(x_a)/p ), (45) 

where -oo < a < oo and p > 0 is called the Gumbel distribution. It appears 
as a limiting distribution in the theory of extreme-value statistics. 

Pearsonian system of distributions Consider a density function f x {x) 
which satisfies the differential equation 

1 l tfx( x ) _ x + a _ 

f x (x) dx b 0 + b^x + b 2 x 2 

for constants a, b 0 ,b u and b 2 - Such a density is said to belong to the Pearsonian 
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system of density functions. Many of the probability density functions that we 
have considered are special cases of the Pearsonian system. For example, if 

2 r y r ~l p -^ X 

f x (*)= r{r) - ho ,,)(*), 

then 

1 df x (x) _ A | r - 1 x - (r - 1)/X 
f x (x) dx x -x/X 

for x > 0; so the gamma distribution is a member of the Pearsonian system with 
a = — {r — 1 )/X, bi = — 1 /X, and b 0 = b 2 = 0. 


4 COMMENTS 

We conclude this chapter by making several comments that tie together some 
of the density functions defined in Secs. 2 and 3 of this chapter. 

4.1 Approximations 

Although many approximations of one distribution by another exist, we will 
give only three here. Others will be given along with the central-limit theorem 
in Chaps. V and VI. 

Binomial by Poisson We defined the binomial discrete density function, 
with parameters n and p, as 

~P)"~ X for x= 0,1,...,«. 

If the parameter n approaches infinity and p approaches 0 in such a way that 
np remains constant, say equal to X, then 

e~ x X x 

pV-p) n ~ x ^— (47) 



for fixed integer x. The above follows immediately from the following con¬ 
sideration : 
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since 


(«)x . 

-► 1 

n x 




as n -* oo. 


Thus, for large n and small p the binomial probability 

can be approximated by the Poisson probability e~" p (np) x lx\. 
this approximation is evident if one notes that the binomial 
volves two parameters and the Poisson only one. 


The utility of 
probability in- 


Binomial and Poisson by normal 

Theorem 20 Let random variable X have a Poisson distribution with 
parameter 2; then for fixed a < b 

P[a < < b] 

v ■^ 

= P[X + ayfk < X < X + ->■ O(A) — d>(a) as 2->oo. (48) 

proof Omitted. [Eq. (48) can be proved using Stirling’s formula, 
which is given in Appendix A. It also follows from the central-limit 
theorem.] //// 


Theorem 21 De Moivre-Laplace limit theorem Let a random variable 
X have a binomial distribution with parameters n and p; then for fixed 
a < b 


* a 


X — np 
Jnpq 


<b 


= P[np + a-Jnpq < X < np + b-Jnpq ] -> 

<b(b) - d>(a) as n -*• oo. 


(49) 


proof Omitted. (This is a special case of the central-limit 
theorem, given in Chaps. V and VI.) //// 


Remark We approximated the binomial distribution with a Poisson 
distribution in Eq. (47) for large n and small p. Theorem 21 gives a 
normal approximation of the binomial distribution for large n. //// 

The usefulness of Theorems 20 and 21 rests in the approximations that 
they give. For instance, Eq. (49) states that P[np + a-Jnpq < X < np + bjnpq ] 
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is approximately equal to <D(Z>) — <5(a) for large n. Or if c = np + a-Jnpq and 
d = np + b^Jnpq, then Eq. (49) gives that P[c < X < d] is approximately equal 
to 

_/</ — np\ /c — np\ 

Ow) (Vw) 

for large n, and, so, an approximate value for the probability that a binomial 
random variable falls in an interval can be obtained from the standard normal 
distribution. Note that the binomial distribution isdiscreteand the approximat¬ 
ing normal distribution is continuous. 


EXAMPLE 15 Suppose that two fair dice are tossed 600 times. Let X 
denote the number of times a total of 7 occurs. Then X has a binomial 
distribution with parameters n = 600 and p = ~. ^[X] = 100. Find 

P[90< X< 110], 


P[90 < X < 110] = 

a sum that is tedious to evaluate. 
Eq. (49), we have 


‘‘° /600\ /lV/5\ 600-J 


I 

j — 90 \ J 


Using the approximation given 


by 


P[9o < ,y < i io] % <d (l 1 - 0 -^ 0 0 ) - Q 


= <D(Vi) - ®(- V!) 


0(1.095) - 0(- 


1.095); 


.726. 

//// 


4.2 Poisson and Exponential Relationship 

When the Poisson distribution was introduced in Subsec. 2.4, an experiment 
consisting of the counting of the number of happenings of a certain phenomenon 
in time was given special consideration. We argued that under certain conditions 
the count of the number of happenings in a fixed time interval was Poisson dis¬ 
tributed with parameter, the mean, proportional to the length of the interval. 
Suppose now that one of these happenings has just occurred; what then is the 
distribution of the length of time, say X , that one will have to wait until the 
next happening? P[X> t] =P[no happenings in time interval of length t] = 
e~'”, where v is the mean occurrence rate; so 

Fx(t) = p [ x ^0 = 1- P[X > f] = 1 - e ~ vt for t > 0; 
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that is. A" has an exponential distribution. On the other hand, it can be proved, 
under an independence assumption, that if the happenings are occurring in 
time in such a way that the distribution of the lengths of time between successive 
happenings is exponential, then the distribution of the number of happenings 
in a fixed time interval is Poisson distributed. Thus the exponential and Poisson 
distributions are related. 


4.3 Contagious Distributions and Truncated Distributions 

A brief introduction to the concept of contagious distributions is given here. 
If/„(•),/!(•), • • • ,/„(')> ••• is a sequence of density functions which are either all 
discrete density functions or all probability density functions which may or may 

not depend on parameters, and p 0 , p t , .. ., p n , ... is a sequence of parameters 

00 00 

satisfying p t > 0 and £ p t = 1, then £ Pif(x) is a density function, which is 

i =0 i = 0 

sometimes called a contagious distribution or a mixture. For example, if 
f 0 (x) = <£„ 0> ao i(x) (a normal with mean p 0 and variance <rg) and f x (x) = _ ffl2 (x), 

then 


Po<t> n ,*A x ) + Pi 

= (1 -p)~ - e -*[<*-«»/* op + p ..\_ (50) 

J2na 0 V 27Cff i 

where p l — p and p 0 = 1 — p, is a mixture of two normal densities. Equation 
(50) is also sometimes referred to as a contaminated normal. A random variable 
Zhas distribution given by Eq. (50) if it is normally distributed with mean /i, and 
variance a\ with probability p and normally distributed with mean p 0 and vari¬ 
ance with probability 1 — p. Contagious distributions or mixtures can be 
useful models for certain experiments. For instance, the mixture of two normal 
distributions given in Eq. (50) has five parameters, namely, p, p 0 , p t , <r 0 , and 
a x . If we vary these five parameters, the density can be forced to assume a 
variety of different shapes, some of which are bimodal; that is, the density has 
two distinct local maximums. 

Physical considerations of the random experiment at hand can sometimes 
persuade one to consider modeling the experiment with a mixture. The 
experimenter may know that the phenomena that he is observing are a mixture; 
for example, the radioactive particle emissions under observation might be a 
mixture of the particle emissions of two, or several, different types of radioactive 
materials. 
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The concept of mixing can be extended. Let {fix', 9)} be a family of 
density functions parameterized or indexed by 9. Let the totality of values 
that the parameter 9 can assume be denoted by 0. If 0 is an interval (possibly 
infinite) and g(9) is a probability density function which is 0 for all arguments 
not in £5, then 

J g /(*; 9)g(9) d9 (51) 

is again a density function, called a contagious distribution or a mixture. For 
example, suppose f(x; 9) = e~ e 9 x /x\ for x = 0, 1,2, ... and f(x; 9) = 0 other¬ 
wise and 




x r 

W) 


00 ,( 0 ), 


a gamma density. Then 


.oo e~ e 9 x X" 

X T r(r + x) ^[(X + \)9] r+x - l e- {X + l)B d[(X + \)9-\ 
“x!r(r)’(/l + l) r+ * J 0 T(r + x) 

_ / x y r(r + x) i 
" Ir+T.j (x!)T(r) (X + l) x 


-{'** *)(r+t)( iti) ror x - 0 - 1 . 


which is the density function of a negative binomial distribution with param¬ 
eters r and p —Xj(X + 1). We say that the derived negative binomial distri¬ 
bution is the gamma mixture of Poissons. 


-e e . 


x! 


9(9) d9 


is sometimes called a compound Poisson, where g(9)I i0 aa) (9) is a probability 
density function. 

We have sketchily illustrated above how new parametric families of den¬ 
sities can be obtained from existing families by the technique of mixing. In 
Subsec. 2.6 we indicated how truncation could be employed to generate new 
families of discrete densities. Truncation can also be utilized to form other 
families of continuous distributions. For instance, the family of beta distri¬ 
butions provides densities that are useful in modeling an experiment for which 
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it is known that the values that the random variable can assume are between 0 
and 1. A truncated normal or gamma distribution would also provide a useful 
model for such an experiment. A normal distribution that is truncated at 0 
on the left and at 1 on the right is defined in density form as 


f(x)=f(x;n, o) = 




(52) 


This truncated normal distribution, like the beta distribution, assumes values 
between 0 and 1. 

Truncation can be defined in general. If Af is a random variable with 
density f x (-) and cumulative distribution F x ( ), then the density of X truncated 
on the left at a and on the right at b is given by 

fx(x)l iaM (x) 

F x (b) — F x (a) K 


PROBLEMS 

1 (a) Let X be a random variable having a binomial distribution with parameters 

n = 25 and p = .2. Evaluate P[X < p x — 2a x ]. 

( b ) If A" is a random variable with Poisson distribution satisfying P[X = 0] = 
P[X= 1], what is S[X]2 

(c) If X is uniformly distributed over (1, 2), find z such that P[X > z + p x ] = 1. 

(d) If X is normally distributed with mean 2 and variance 1, find P[\X —2\ < 1], 

(e) Suppose X is binomially distributed with parameters n and p; further sup¬ 
pose that <?[X] = 5 and var [AT] = 4. Find n and p. 

(/) If S[X\ = 10 and ax = 3, can X have a negative binomial distribution? 

(g) if x has a negative exponential distribution with mean 2, find P[X <\\X < 2], 

( h ) Name three distributions for which P[X < p x ] — h 

(/) Let X be a random variable having binomial distribution with parameters 
n — 100 and p = . 1. Evaluate P[X < p, x — 2a x ]. 

(j) If X has a Poisson distribution and P[X =■- 0] = i, what is <?[X]? 

( k ) Suppose X has a binomial distribution with parameters n and p. For what 
p is var [X] maximized if we assumed n is fixed ? 

(/) Suppose X has a negative exponential distribution with parameter A. If 
P[X < 1] = P[X > 1], what is var [X]2 

(m) Suppose A is a continuous random variable with uniform distribution 
having mean 1 and variance f. What is P[X < 0] ? 

(n) If A" has a beta distribution, can $[l/X] be unity? 

(o) Can X ever have the same distribution as — XI If so, when ? 

( p ) If A" is a random variable having moment generating function exp (e' — 1), 
what is ^[X]l 

2 (a) Find the mode of the beta distribution. 

( b) Find the mode of the gamma distribution. 
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3 Name a parametric family of distributions which satisfies: 

(a) The mean must be greater than or equal to the variance. 

( b ) The mean must be equal to the variance. 

(c) The mean must be less than or equal to the variance. 

(d) The mean can be less than, equal to, or greater than the variance (for dif¬ 
ferent parameter values). 

4 ( a ) If X is normally distributed with mean 2 and variance 2, express 

P[ | A" — 11 < 2] in terms of the standard normal cumulative distribution 
function. 

(b) If X is normally distributed with mean p > 0 and variance cr 2 = /x 2 , express 
P[X < — p\ X < /x] in terms of the standard normal cumulative distribution 
function. 

(c) Let X be normally distributed with mean /x and variance a 2 . Suppose a 2 is 
some function of /x, say a 2 = h(p). Pick h( ■) so that P[X < 0] does not 
depend on p for /x > 0. 

5 Use the alternate definition of the median as given in the remark following Defi¬ 
nition 18 of Chap. II. Find the median in each of the following cases: 

(a) f x (x) = Xe~ ,x l i0 , «)(x). 

( b ) X is uniformly distributed on the interval (ffi, d 2 ). 

(c) X has a binomial distribution with n = 4, p = .5. 

(d) X has a binomial distribution with n = 5, p = .5. 

( e) X has a binomial distribution with n = 2, p = .9. 

*6 A contractor has found through experience that the low bid for a job (excluding 
his own bid) is a random variable that is uniformly distributed over the interval 
(f C, 2C), where C is the contractor’s cost estimate (no profit or loss) of the job. 
If profit is defined as 0 if the contractor does not get the job (his bid is greater than 
the low bid) and as the difference between his bid and his cost estimate C if he gets 
the job, what should he bid (in terms of C) in order to maximize his expected 
profit ? 

7 A merchant has found that the number of items of brand XYZ that he can sell 
in a day is a Poisson random variable with mean 4. 

(a) How many items of brand XYZ should the merchant stock to be 95 percent 
certain that he will have enough to last for 25 days? (Give a numerical 
answer.) 

(b) What is the expected number of days out of 25 that the merchant will sell 
no items of brand XYZ ? 

8 (a) If X is binomially distributed with parameters n and p, what is the distribution 

of Y=n - XI 

(b) Two dice are thrown n times. Let X denote the number of throws in which the 
number on the first die exceeds the number On the second die. What is the 
distribution of XI 

*(c) A drunk performs a “randomwalk” over positions 0, ± 1, ±2 ,... as follows: 
He starts at 0. He takes successive one-unit steps, going to the right with 
probability p and to the left with probability 1 — p. His steps are inde- 
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pendent. Let X denote his position after n steps. Find the distribution of 
(X + n)/ 2, and then find $[X]. 

*{d) Let Xi (X 2 ) have a binomial distribution with parameters n and />, (n and p 2 ). 
If Pl <p 2 , show that P[ X y < k]>P[X 2 < k] for k = 0, 1, ..., n. (This 
result says that the smaller the p, the more the binomial distribution is shifted 
to the left.) 

9 In a town with 5000 adults, a sample of 100 is asked their opinion of a proposed 
municipal project; 60 are found to favor it, and 40 oppose it. If, in fact, the 
adults of the town were equally divided on the proposal, what would be the prob¬ 
ability of obtaining a majority of 60 or more favoring it in a sample of 100? 

10 A distributor of bean seeds determines from extensive tests that 5 percent of a large 
batch of seeds will not germinate. He sells the seeds in packages of 200 and 
guarantees 90 percent germination. What is the probability that a given package 
will violate the guarantee? 

*11 ( a ) A manufacturing process is intended to produce electrical fuses with no 
more than 1 percent defective. It is checked every hour by trying 10 fuses 
selected at random from the hour’s production. If 1 or more of the 10 
fail, the process is halted and carefully examined. If, in fact, its prob¬ 
ability of producing a defective fuse is .01, what is the probability that the 
process will needlessly be examined in a given instance? 

( b ) Referring to part (a), how many fuses (instead of 10) should be tested if the 
manufacturer desires that the probability be about .95 that the process will 
be examined when it is producing 10 percent defectives? 

12 An insurance company finds that .005 percent of the population die from a certain 
kind of accident each year. What is the probability that the company must pay 
off on more than 3 of 10,000 insured risks against such accidents in a given 
year? 

13 (a) If X has a Poisson distribution with P[X — 1 ] = P[X = 2], what is 
P[X = 1 or 2]? 

(b) If X has a Poisson distribution with mean 1, show that \ X — 11 ] = 2 <j x /e. 

*14 Recall Theorems 4 and 8. Formulate, and then prove or disprove a similar 
theorem for the negative binomial distribution. 

*15 Let X be normally distributed with mean p and variance a 1 . Truncate the density 
of X on the left at a and on the right at b, and then calculate the mean of the trun¬ 
cated distribution. (Note that the mean of the truncated distribution should fall 
between a and b. Furthermore, if a — p — c and b = p + c, then the mean of the 
truncated distribution should equal p-) 

*16 Show that the hypergeometric distribution can be approximated by the binomial 
distribution for large M and K\ i.e., show that 
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17 Let X be the life in hours of a radio tube. Assume that X is normally distributed 
with mean 200 and variance a 2 . If a purchaser of such radio tubes requires that 
at least 90 percent of the tubes have lives exceeding 150 hours, what is the largest 
value <7 can be and still have the purchaser satisfied? 

18 Assume that the number of fatal car accidents in a certain state obeys a Poisson 
distribution with an average of one per day. 

(a) What is the probability of more than ten such accidents in a week? 

( b ) What is the probability that more than 2 days will lapse between two such 
accidents ? 

19 The distribution given by 

/(*; P) = xe- i<xl0>1 1 (0 . oo,(x) for /3 > 0 

is called the Rayleigh distribution. 

(a) Show that the mean and variance exist, and find them. 

( b ) Does the Rayleigh distribution belong to the Pearsonian system ? 

20 The distribution given by 

fix', P) = o 3 '~ r x 2 e-* 2ll,2 I ( o. ao>(x) for p > 0 

P \JTT 

is called the Maxwell distribution. 

(a) Show that the mean and variance exist, and find them. 

(. b ) Does this distribution belong to the Pearsonian system? 

21 The distribution given by 


22 

23 

24 

25 


fix', n) = _ 2 ]/ 2 ) (1 — ■ x ’ 2)< "~ 4>/2/ t-i. nC*) 


is called the r distribution. 


(a) Show that the mean and variance exist, and find them. 

(b) Does this distribution belong to the Pearsonian system ? 

A die is cast until a 6 appears. What is the probability that it must be cast more 
than five times ? 


Red-blood-cell deficiency may be determined by examining a specimen of the 
blood under a microscope. Suppose that a certain small fixed volume contains, 
on an average, 20 red cells for normal persons. What is the probability that a 
specimen from a normal person will contain less than 15 red cells? 

A telephone switchboard handles 600 calls, on an average, during a rush hour. 
The board can make a maximum of 20 connections per minute. Use the Poisson 
distribution to evaluate the probability that the board will be overtaxed during any 
given minute. 

Suppose that a particle is equally likely to release one, two, or three other particles, 
and suppose that these second-generation particles are in turn each equally likely 
to release one, two, or three third-generation particles. What is the density of 
the number of third-generation particles? 
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26 Find the mean of the Gumbel distribution. 

27 Derive the mean and variance of the Weibull distribution. 
*28 Show that 


P[X>k] 



B(k, n 


k + 1) 



(/)" “ dn 


for X a binomially distributed random variable. That is, if X is binomially dis¬ 
tributed with parameters n and p and Y is beta-distributed with parameters k and 
n-k+\, then F r (p) = I - F x (k - 1). 

*29 Suppose that X has a binomial distribution with parameters n and p and Y has a 
negative binomial distribution with parameters r and p. Show that F x (r — I) = 
1 — Fy(n-r). 

*30 If U is a random variable that is uniformly distributed over the interval [0, I ], then 
the random variable Z, = [U' — (I — UY]!\ is said to have Titkey's symmetrical 
lambda distribution. Find the first four moments of Z>. Find two different A’s, 
say and A 2 , such that Z 2 , and Z; 2 have the same first four moments and unit 
standard deviations. 



IV 

JOINT AND CONDITIONAL DISTRIBUTIONS, 
STOCHASTIC INDEPENDENCE, 
MORE EXPECTATION 


1 INTRODUCTION AND SUMMARY 

The purpose of this chapter is to introduce the concepts of Udimensional 
distribution functions, conditional distributions, joint and conditional expecta¬ 
tion, and independence of random variables. It. like Chap. II, is primarily a 
“ definitions-and-their-understanding” chapter. 

The chapter is divided into four main sections in addition to the present 
one. In Sec. 2, joint distributions, both in cumulative and density-function 
form, are introduced. The important /r-dimensional discrete distribution, 
called the multinomial, is included as an example. Conditional distributions and 
independence of random variables are the subject of Sec. 3. Section 4 deals 
with expectation with respect to ^-variate distributions. Definitions of covari¬ 
ance, the correlation coefficient, and joint moment generating functions, all 
of which are special expectations, are given. The important concept of condi¬ 
tional expectation is discussed in Subsec. 4.3. Results relating independence 
and expectation are presented in Subsec. 4.5, and the famous Cauchy-Schwarz 
inequality is proved in Subsec. 4.6. The last main section, Sec. 5, is devoted 
to the important bivariate normal distribution, which gives one unified example 
of many of the terms defined in the preceding sections. 
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This chapter is the multidimensional analog of Chap. II. It provides 
definitions needed to understand distributional-theory results of Chap. V. 


2 JOINT DISTRIBUTION FUNCTIONS 

In the study of many random experiments, there are, or can be, more than one 
random variable of interest; hence we are compelled to extend our definitions 
of the distribution and density function of one random variable to those of 
several random variables. Such definitions are the essence of this section, 
which is the multivariate counterpart of Secs. 2 and 3 of Chap. II. As in the 
univariate case we will first define, in Subsec. 2.1, the cumulative distribution 
function. Although it is not as convenient to work with as density functions, 
it does exist for any set of k random variables. Density functions for jointly 
discrete and jointly continuous random variables will be given in Subsecs. 2.2 
and 2.3, respectively. 

2.1 Cumulative Distribution Function 

Definition 1 Joint cumulative distribution function Let X k , X 2 , ...,X k 
be k random variables all defined on the same probability space 
(XI, sd, /’[ ]). The joint cumulative distribution function of X t , ..., X k , 
denoted by F Xl ,..., x k ('> ■■■> ')»' s defined as P [X t < x t ; ...; X k < xj for 

all (xi,x 2 > •••>**)• //// 

Thus a joint cumulative distribution function is a function with domain 
euclidean k space and counterdomain the interval [0, 1], If k = 2, the joint 
cumulative distribution function is a function of two variables, and so its 
domain is just the xy plane. 


EXAMPLE 1 Consider the experiment of tossing two tetrahedra (regular 
four-sided polyhedron) each with sides labeled 1 to 4. Let X denote the 
number on the downturned face of the first tetrahedron and Y the larger 
of the downturned numbers. The goal is to find F x y (-, ), the joint cu¬ 
mulative distribution function of X and Y. Observe first that the random 
variables X and Y jointly take on only the values 

(1, 1), (1,2), (1,3), (1,4), 

(2,2), (2,3), (2,4), ' 

(3, 3), (3, 4), 

(4, 4). 

(The first component is the value of X, and the second the value of Y.) 
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*2 


FIGURE 1 

Sample space for experiment of tossing 
two tetrahedra. 



First tetrahedron 


The sample space for this experiment is displayed in Fig. 1. The 16 
sample points are assumed to be equally likely. Our objective is to find 
F x Y (x, y) for each point (x, y). As an example let (x, y) = (2, 3), and 
find F x r (2, 3) = P[X < 2; Y< 3]. Now the event {X < 2 and 3} 
corresponds to the encircled sample points in Fig. 1; hence F x y (2, 3) = 
-fe. Similarly, Fx,y( x > y) can be found for other values of x and y. 
F x, r(*> y) ^ tabled in Fig. 2. //// 


We saw that the cumulative distribution function of a unidimensional 
random variable had certain properties; the same is true of a joint cumulative. 
We shall list these properties for the joint cumulative distribution function of 
two random variables; the generalization to k dimensions is straightforward. 


TABLE OF VALUES OF Fx. t (x,y) 


4 <y 

0 

1% 

JL 

16 

if 

1 

3 <y< 4 

0 


m 1 

vS 


2 <_f < 3 

0 

a 

■ 


4 

Tff 

1 < y < 2 

0 



•lg 

I 

16 

y< 1 

0 

0 


0 

0 


X< 1 

1 <x<2 

■ 

3 < x < 4 

4 < x 


FIGURE 2 
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Properties of bivariate cumulative distribution function F( -, •) 

(i) F( — co,y) = lim F(x, >>) = Oforall y, F(x, — oo) = lim F(x, y) = 0 

x-+— oo — oo 

for all x, and lim F(x, y) = F(oo, co) = 1. 

x-+ oo 
y-*oo 

(ii) If x t < x 2 and y v < y 2 , then P[x, < X < x 2 \ y x < Y <y 2 ] 

= F(x 2 , y 2 ) - F{x 2 , y x ) - F(x u y 2 ) + F(x„ y t ) > 0. 

(iii) F(x, y) is right continuous in each argument; that is, 

lim F(x + Ii, y) = lim F(x, y + h) = F(x, j). 

0 <—»■ 0 0 < A -»0 

We will not prove these properties. Property (ii) is a monotonicity 
property of sorts; it is not equivalent to F(x t , y t ) < F(x 2 , y 2 ) for x t < x 2 
and y, < y 2 . Consider, for example, the bivariate function G(x, y) defined 
as in Fig. 3. Note that G(x t , yi)<G(x 2 , y 2 ) for x, < x 2 and y t < y 2 , 
yet C(1 +s,l ■+*£) — C(1 +s,l — s) — C(1 — 6,1 +£) + C/(l — r, 1 — s) = 1 — 
(1 — e) — (1 — e) = 2e — 1 <0 for e < \ \ so G(x, y) does not satisfy property 
(ii) and consequently is not a bivariate cumulative distribution function. 

Definition 2 Bivariate cumulative distribution function Any function 
satisfying properties (i) to (iii) is defined to be a bivariate cumulative 
distribution function without reference to any random variables. //// 

Definition 3 Marginal cumulative distribution function If F x, r('> •) is 

the joint cumulative distribution function of X and Y, then the cumulative 
distribution functions F x (-) and F Y ( ) are called marginal cumulative 
distribution functions. //// 


TABLE OF G(x, y) 


i <y 

0 

X 

1 

0<y<\ 

0 

0 

y 

y< 0 

0 

0 

0 


x < 0 

0 < x < 1 

1 <x 


FIGURE 3 
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Remark F x (x) = F XY {x, oo), and F Y {y) = F x Y (co, y); that is, knowl¬ 
edge of the joint cumulative distribution function of X and Y implies 
knowledge of the two marginal cumulative distribution functions. //// 


The converse of the above remark is not generally true; in fact, an example 
(Example 8) will be given in Subsec. 2.3 below that gives an entire family of 
joint cumulative distribution functions, and each member of the family has the 
same marginal distributions. 

We will conclude this section with a remark that gives an inequality 
involving the joint cumulative distribution and marginal distributions. The 
proof is left as an exercise. 


Remark F x (x) + F Y {y) - 1 < F x y (x, y) < JF x (x)F y (y) for all x, y. 

//// 


2.2 Joint Density Functions for Discrete Random Variables 

If Aj, X 2 , • • •, X k are random variables defined on the same probability space, 
then (Aj, X 2 , .... X k ) is called a k-dimensional random variable. 

Definition 4 Joint discrete random variables The /.'-dimensional ran¬ 
dom variable (X,, X 2 , .... X k ) is defined to be a k-dimensional discrete 
random variable if it can assume values only at a countable number of 
points (x,, x 2 , •••, x k ) in /-dimensional real space. We also say that 
the random variables Aj, X 2 , .... X k are joint discrete random variables. 

//// 

Definition 5 Joint discrete density function If {X u X 2 ,..., X k ) is 
a /-dimensional discrete random variable, then the joint discrete density 

function of (Aj, X 2 ,..., X k ), denoted by/ X] Xi x f -, •), is defined 

to be 

fx,.x 2 xiX^ij x 2 , • •., x k ) = J > [2f 1 = Xj;A "2 — x 2 \ ■ ■ ■', X k = x k ] 

for (xj, x 2 , ..., x k ), a value of (A' 1 , A" 2 , ..., A'*) and is defined to be 0 
otherwise. //// 


Remark Y.fx t . •••> x*) — 1 , where the summation is over all 

possible values of (X u ..., A*). //// 
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fx.r(x,y) 



EXAMPLE 2 Let X denote the number on the downturned face of the first 
tetrahedron and 7the larger of the downturned numbers in the experiment 
of tossing two tetrahedra. The values that (X, Y) can take on are (1, 1), 
(1,2), (1, 3), (1,4), (2, 2), (2, 3), (2, 4), (3, 3), (3, 4), and (4,4); hence X and 
Y are jointly discrete. The joint discrete density function of X and Y 
is given in Fig. 4. 

In tabular form it is given as 


(*>j) 

(1, •) 

(1,2) 

(1,3) 

(1,4) 

(2, 2) 

(2, 3) 

(2,4) 

(3, 3) 

(3,4) 

(4, 4) 

fx. r(x,y) 

fw 

1 

1 6 

1 

TUT 

1 

2 

1 6 

1 

1 6 

1 

1 6 

3 

1 6 

1 

1 6 

4 

1 6 


or in another tabular form as 



//// 


Theorem 1 If A" and Y are jointly discrete random variables, then 
knowledge of F x r (-, •) is equivalent to knowledge of f x< y(-, •). Also, 
the statement extends to fc-dimensional discrete random variables. 
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proof Let (x,, j t ), (x 2 , y 2 ), • •. be the possible values of ( X , Y). 

I f/x, r(’> ■) is given, then F x y (x, y) = £ A, y (x ; , j,), where the summa¬ 
tion is over all i for which x t <, x and y t < y. Conversely, if F XY (-, ') is 
given, then for (x f , j ; ), a possible value of ( X , Y), 

fx r(Xi, yd = F x Y (xi , j,) - lim F x y (x ; - h, y t ) 

0 </ i - + 0 

- lim F x Y (Xi,yi - h) 

0 </ i -+0 

+ lim F XY (Xi - h, y t - h). I HI 

0 < h->0 

Definition 6 Marginal discrete density If A" and Y are jointly discrete 
random variables, then f x ( ) and f Y {-) are called marginal discrete 
density functions. More generally, let X u , ..., X im be any subset of the 

jointly discrete random variables X t ,.... X k \ then f Xii . x, m ( x h’ ■ ■ ■ > x -m) 

is also called a marginal density. //// 

Remark If X u X k are jointly*discrete random variables, then any 
marginal discrete density can be found from the joint density, but not 
conversely. For example, if X and Y are jointly discrete with values 
(x t , .ft), (x 2 ,y 2 ), • • •, then 

fx(x k )= X fx,r(x„y,) and f Y (y k ) = £ f x , Y (x t ,y t ). //// 

(i-.xi=x k ) U:yi = yk} 

Heretofore we have indexed the values of (X, Y) with a single index, 
namely That is, we listed values as (x l7 j,), (x 2 , y 2 ), ..., (x it y f ), .... The 
values of (X, Y) could also be indexed by using separate indices for the X and Y 
values. For instance, we could let / index the possible X values, say x t , ..., 

Xi ,and j index the possible Y values, say y t , ..., y Jt _ Then the values 

of (X, Y) would be a subset of the points (x ; , y.) for / = 1, 2,... and/ = 1, 2,.... 
If this latter method of indexing is used, then the marginal density of X is 
obtained as follows: 


fx(Xk) — X/x, r(x k * yj), 

J 

where the summation is over all y } for the fixed x k . The marginal density of Y 
is analogously obtained. The following example may help to clarify these two 
different methods of indexing the values of (X, Y). 

EXAMPLE 3 Return to the experiment of tossing two tetrahedra, and define 
X as the number on the downturned face of the first tetrahedron and Fas 
the larger of the numbers on the two downturned faces. The joint 
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density of X and Y is given in Fig. 4. The values of (X, Y) can be listed 
as (1, 1), (1, 2), (1, 3), (1, 4), (2, 2), (2, 3), (2, 4), (3, 3), (3, 4), and (4, 4), 
10 points in all. Or, if we note that X has values 1, 2, 3, and 4; Y has 
values 1, 2, 3, and 4; and Y is greater than or equal X, the values of 

(X, Y ) are {(i,j): i = 1.4;/= 1.4; and i < j}. Let us use each 

of these methods of indexing to evaluate F x y (2, 3) from the joint density. 
Under the first method of indexing, 

Tr,r(->3)= £ fx,y( x i> Ti) 

{■':xi<2,)>i<3i 

= fx,y( 1, 1)+/,, *(1,2) 

+ fx, y(l . 3) + fx, y(2, 2) + f x y (2, 3) = -,%■ 

Under the second method of indexing, 

Tx,y( 2,3)= Y^fx.y(fj) = ts- 

■ = 1 J = i 

Similarly, all other values of F x Y (-, ■) could be obtained. Also 
A(3) = X fx, y( x i< Yd — fx, r(l> 3) +fx, r(2, 3) +f x , y(3, 3) 

V:yt = 3} 

_ _1_ J. 1 | 3 _ 5 

“16 T T5" T - 1 6 — 16 " 

Similarly AO) = rg-’ A(2) = ar| d A( 4 ) = T 2 ?- which together with 
A(3) = -nr give the marginal discrete density function of Y. //// 


EXAMPLE 4 We mentioned that marginal densities can be obtained from 
the joint density, but not conversely. The following is an example of a 
family of joint densities that all have the same marginals, and hence we 
see that in general the joint density is not uniquely determined from 
knowledge of the marginals. Consider altering the joint density given 
in the previous examples as follows: 




2 


JOINT distribution functions 137 


For each 0 < e <-jV’ the above table defines a joint density. Note that 
the marginal densities are independent of s, and hence each of the joint 
densities (there is a different joint density for each 0 < e <tV) h as 
same marginals. //// 

We saw that the binomial distribution was associated with independent, 
repeated Bernoulli trials; we shall see in the example below that the multinomial 
distribution is associated with independent, repeated trials that generalize from 
Bernoulli trials with two outcomes to more than two outcomes. 


EXAMPLE 5 Suppose that there are k + 1 (distinct) possible outcomes of a 
trial. Denote these outcomes by j,, j 2 , ..., a k + l , and let Pi = P[o t ], 

k + 1 

i = 1, ..., k + 1. Obviously we must have £ Pi — 1, just as p + q = 1 in 

i= i 

the binomial case. Suppose that we repeat the trial n times. Let X t 
denote the number of times outcome j, occurs in the n trials, 
/ = 1, ..., k + 1. If the trials are repeated and independent, then the 
discrete density function of the random variables X lt X k is 

n\ * +1 

A,. x k (xi .= El Pt Xi > (0 

i~ 1 

* + l k 

where x, = 0,...,« and £ x, = n. N ote that X k + l = n - V X,. 

.= i ; = i 

To justify Eq. (1), note that the left-hand side is P[X { = x,; X 2 = x 2 ; 

... ; X k+l = x* +1 ]; so, we want the probability that the n trials result in 
exactly Xj outcomes j ,, exactly x 2 outcomes j 2 ,..., exactly x k+1 outcomes 

k + 1 

■ J k+u where £ x ; = n. Any specific ordering of these n outcomes has 
i 

probability p\' ■ p x 2 > ■ ■ ■ pl\\' by the assumption of independent trials, 
and there are n\/x l \x 2 \ • • • x t + 1 ! such orderings. //// 


Definition 7 Multinomial distribution The joint discrete density func¬ 
tion given in Eq. (1) is called the multinomial distribution. //// 

The multinomial distribution is a (k + 1) parameter family of distri¬ 
butions, the parameters being n and p„ p 2 , ..., Pk . Pk + l i s , like q in the 
binomial distribution, exactly determined by p k + l = \ - Pl _ Pl _ •.. _ Pk . a 
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/(*i. X 2 ) 



particular case of a multinomial distribution is obtained by putting, for example, 
n = 3, k = 2, p t = .2, and p 2 = .3, to get 


fxi,x 2 ( x i’ * 2 ) —f( x i> * 2 ) — 


_3!_ 

x t !x 2 !(3 - x, - x 2 )! 


(.2) X, (-3) X2 (-5) 3_X,_X2 . 


This density is plotted in Fig. 5. 

We might observe that if X u X 2 . X k have the multinomial distribu¬ 

tion given in Eq. (1), then the marginal distribution of X t is a binomial distri¬ 
bution with parameters n and p t . This observation can be verified by recalling 
the experiment of repeated, independent trials. Each trial can be thought of 
as resulting either in outcome o- t or not in outcome , in which case the trial is 
Bernoulli, implying that X t has a binomial distribution with parameters n and p t . 


2.3 Joint Density Functions for Continuous Random Variables 

Definition 8 Joint continuous random variables and density function The 

^-dimensional random variable (X u X 2 , X k ) is defined to be a 
/t-dimensional continuous random variable if and only if there exists a 
function f Xu ...,x k ( > • • • > ') ^ 0 such that 

F Xl . x k ( x i’ • • • ’ = J •••( fx ..x k ( u i> • • •» u k) d u i ■ ■ ■ du k (2) 

for all (x t , ..., x k ). f Xl . Xk (-, •) is defined to be a joint probability 

density function. I HI 
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As in the unidimensional case, a joint probability density function has 
two properties: 

(0 A, . x k (x u x k )>0. 

(ii) f ... f fxi....,x k (xi, x k ) d x i ... dx k —1. 

* CO * — 00 

A unidimensional probability density function was used to find proba¬ 
bilities. For example, for X a continuous random variable with probability 
density /*(■), P[a < X< b] = ja/x(x) dx\ that is, the area under f x (') over the 
interval (a, b) gave P[a < X < b]\ and, more generally, P[Xe B] = \ B f x (x) dx\ 
that is, the area under f x {') over the set B gave P[X e B\. In the two-dimen¬ 
sional case, volume gives probabilities. For instance, let (X t , X 2 ) be jointly 
continuous random variables with joint probability density function 
fx, x 2 ( x i’ x f), and let R bo some region in the x,x 2 plane; then P[(X l , X 2 ) e /?] 

= ||/x,,x 2 ( x i> * 2 ) dx | dx 2 \ that is, the probability that (X l , X 2 ) falls in the 
r' 

region R is given by the volume under f Xl X2 i ' > ') over the region R. In particu¬ 
lar if R = {(x t , x 2 ): a t < x t < b t ; a 2 < x 2 < b 2 }, then 

P[a 1 < X t < b t ; a 2 < X 2 < Z> 2 ] = f F f f X i,xii x i’ * 2 ) dx 1 dx 2 . 

J a 2 L J ai 

A joint probability density function is defined as any nonnegative integrand 
satisfying Eq. (2) and hence is not uniquely defined. 


EXAMPLE 6 Consider the bivariate function 

fix, y) = K(x + j)/ (0 ,,)«/ ( o. nOO = K(x + y)I v {x, y), 

where U = {(x, y): 0<x< 1 and 0<j>< 1}, a unit square. Can the 
constant K be selected so that/(x, y) will be a joint probability density 
function ? If K is positive, /(x, y) > 0. 

f°° f°° r 1 r 1 

Kf(x, y) dx dy = K(x + y) dx dy 

J -00 J -CO J 0 d 0 

~ K \ j (x + y) dx dy 
J o J o 

= K ) iz+y)dy 
J o 

= m +i) 

= 1 
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f(x,y) 



for jfC = 1. So f(x, y) = (x + y)I( 0 l) (x)I (0 t) (y) is a joint probability 
density function. It is sketched in Fig. 6 . 

Probabilities of events defined in terms of the random variables can 
be obtained by integrating the joint probability density function over the 
indicated region; for example 

P{ 0 <X<i;0<F<iJ= f f (x + y) dx dy 

Jo Jo 

= 1 4 . JL 

32+64 

_ _ 3 _ 

6 4 > 

which is the volume under the surface z = x + y over the region {(x, y): 
0 < x < i ; 0 < y < £} in the xy plane. //// 

Theorem 2 If A 1 and Y are jointly continuous random variables, then 
knowledge of F x Y (', •) is equivalent to knowledge of an f x< y(', ). The 
remark extends to ^-dimensional continuous random variables. 

proof For a given f x Y (-, ), F x Y (x, j') is obtained for any 
(x, y) by 

F x, y(x, y) = I I f Xt Y (u, v) du dv. 
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For given F x y (-, •), an f x y (x, y ) can be obtained by 


fx, y( x > y ) 


S 2 F X r(*> y) 
dx dy 


for x, y points, where F x Y (x, y) is differentiable. 


1 / 1 / 


Definition 9 Marginal probability density functions If X and Y are 

jointly continuous random variables, then f x ( ) and fy(') are called 
marginal probability density functions. More generally, let XX im 
be any subset of the jointly continuous random variables X t , ..., X k . 

f Xi . x . (x tl ,..., x im ) is called a marginal density of the m-dimensional 

random variable (A 1 ;,, ..., X im ). //// 

Remark If X u ..., X k are jointly continuous random variables, then 
any marginal probability density function can be found. (However, 
knowledge of all marginal densities does not, in general, imply knowledge 
of the joint density, as Example 8 below shows.) If X and Y are jointly 
continuous, then 


f® r® 

fx(x)= I fx.r(x,y)dy and f r (y) = fx,v{x,y)dx 

J — 00 •' - on 


(3) 


since 


fx( x ) = 


dF x (x) d 
dx dx 


/_ (j _ fx.r( u ’y) d yj du \ = J fx,r(x,y)dy. 


Ill / 


EXAMPLE 7 Consider the joint probability density 


fx,r(x, y) = {X + y)h o, n(x)J( 0 , d(>0- 


Fx,r ( x , y) = I(o,i)(x)fo,i)(y) J* ( M + v) du dv 

+ ho, i>(*)f[i, <»)(k) j o ( u + v) du dv 
+ hi, °o)( x )h°, i)O0 J J i u + v) du dv 


+ hi, «>)( x )hi, 

= i{(x 2 y + xy 2 )ho, i£x)ho.i£y) + (x 2 + x)/ (0 , i>(x)I lu OT) (y) 

+ (y + y 2 )hi, oc >«/«>, i m + hi, * y(x)i lu 
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fx(*)= | fx,r(x,y)dy 
J — oo 

= I(o,i)(x) j (x +y) dy 


or. 


fx(x') = 


— (x + i)(*); 

dF XtY (x, oo) 


dx 

dF x {x) 

dx 


d lx 2 + x\ 
= (x +i)/ ( 0 ,i)(x). 


IllI 


EXAMPLE 8 Let f x {x) and f Y (y) be two probability density functions with 
corresponding cumulative distribution functions F x {x) and F Y (y), respec¬ 
tively. For — 1 < a < 1, define 

fx, Y(x,y, «) = fx( x )f Y (y){ 1 + x[2 F x (x) — I ][2F y (y) — 1]}. (4) 

We will show (i) that for each a satisfying — 1 < a < l,f Xt Y (x, y; a) is a 
joint probability density function and (ii) that the marginals of f Xt y (x, y; a) 
are f x (x) and f Y (y), respectively. Thus, {f Xy Y (x, y, a): — 1 < a < 1} will be 
an infinite family of joint probability density functions, each having the 
same two given marginals. To verify (i) we must show that f Xt Y (x, y; a) 
is nonnegative and, if integrated over the xy plane, integrates to 1 . 

fx(x)f Y (y){ 1 + x[2F*(x) - 1][2 F r {y) - 1 ]} > 0 

if l>-x[2F x (x)- \}[2Fy{y)- 1]; 

but a, 2 F x {x) — 1 , and 2 F Y (y) — 1 are all between — 1 and 1 , and hence 
also their product, which implies f Xt y(x, y\ a) is nonnegative. Since 

3 r 00 / r 00 

fx(x) dx = | f x y (x, y; a) dy 

oo J - oo \ J - oo 
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it suffices to show that f x (x) and f Y (y) are the marginals of f x , Y {x, yl “)• 

.00 

fx,r(x, y;oc)dy 

J - GO 

= f fxix)f Y {y){ 1 + a[2 F x (x) - J ][2 F Y (y) - 1 ]} dy 

J — oo 

= fx(x) I'” f Y (y) dy + xf x (x)[2F x (x) - 1] f ° [2 F Y (y) - l]f Y (y) dy 

= /*(*)> noting that I [2F y (y) - 1 ]/ y (y) dy 

" — 00 

= f ( 2 h — 1 ) du = 0 
J o 

by making the transformation u = F Y (y). I HI 


3 CONDITIONAL DISTRIBUTIONS AND 
STOCHASTIC INDEPENDENCE 

In the preceding section we defined the joint distribution and joint density 
functions of several random variables; in this section we define conditional 
distributions and the related concept of stochastic independence. Most defini¬ 
tions will be given first for only two random variables and later extended to k 
random variables. 


3.1 Conditional Distribution Functions 
for Discrete Random Variables 


Definition 10 Conditional discrete density function Let X and Y be 

jointly discrete random variables with joint discrete density function 
fx, y{ 1 > ’)• The conditional discrete density function of Y given X = x, 
denoted by f Y \ x { - \ x), is defined to be 

<5 > 

if fx(x) > 0, where f x (x) is the marginal density of X evaluated at x. 
f Y \ x f | x) is undefined for f x (x) = 0 . Similarly, 


fx\ Y (x | y) = 


fx , y(x, y) 
My) 


if My) > o. 


( 6 ) 

mi 
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Since X and Y are discrete, they have mass points, say x t , x 2 , ... for X 
and j>!, y 2 ,... for Y. If f x (x) > 0, then x = x, for some and f x {x t ) 
= P[X = xj. The numerator of the right-hand side of Eq. (5) is f Xt y(x;, )’j) 
= P[X = x ; ; Y = yj ]; so 


/rixCf'j I x t) 


fx,y{ x ii yj) 


P[X = x, ;y = yj] 
P[X = x^ 


= P\Y = yj \X = xj, 


for yj a mass point of Y and x ; a mass point of X; hence f Y \ x (- |x) is a condi¬ 
tional probability as defined in Subsec. 3.6 of Chap. I. f Y \ x (- |x) is called a 
conditional discrete density function and hence should possess the properties 
of a discrete density function. To see that it does, consider x as some fixed 
mass point of X. Then fy\ x {y\x) is a function with argument y; and to be a 
discrete density function must be nonnegative and, if summed over the possible 
values (mass points) of Y, must sum to 1. fy\ X (y\x) is nonnegative since 
/x, y(x, y) is nonnegative and f x (x) is positive. 


j 


y /x.r( x * yj) 
J fx(x) 


1 r, , s fx(x) , 


where the summation is over all the mass points of Y. (We used the fact that 
the marginal discrete density of X is obtained by summing the joint density of 
X and Y over the possible values of Y.) So f Y \ X ( - |x) is indeed a density; it 
tells us how the values of Y are distributed for a given value x of X. 

The conditional cumulative distribution of Y given X = x can be defined 
for two jointly discrete random variables by recalling the close relationship 
between discrete density functions and cumulative distribution functions. 


Definition 11 Conditional discrete cumulative distribution If X and Y 

are jointly discrete random variables, the conditional cumulative distribu¬ 
tion of Y given X = x, denoted by F r | X (- |x), is defined to be F Y \ x {y\x) = 
P[Y < y \ X = x] for/ x (x) > 0. I HI 

Remark F Y \ X iy\x)= £ AixO',1*)- //// 


EXAMPLE 9 Return to the experiment of tossing two tetrahedra. Let X 
denote the number on the downturned face of the first and Y the larger 
of the downturned numbers. What is the density of Y given that X = 2? 
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Also, 


fy\x(2\2) = 


fxM 2 > 2 ) 

/x( 2) 


fr\ X (m = 


fx, r(2» 3) 
fx( 2) 


/y|x(4| 2) = 


7 x, y(2> 4) 

A(2) 


/yix(>>|3) = {* 


2 

T6 

_4 
1 6 


1 

n. 

4 

X6 

1 

1 6 
4 

T 6 


for y 
tor y 


1 

2 

1 

4 

1 

4' 

3 

4 


III / 


Definition 12 Conditional discrete density function Let (Aj,..., X k ) be 

a A-dimensional discrete random variable, and let X tl ,X ir and 
X jr ..., X Js be two disjoint subsets of the random variables X lt ..., X k . 
The conditional density of the /--dimensional random variable (X tl ,X ir ) 
given the value (xy,, ..., x Js ) of (X Jt , ..., X Js ) is defined to be 

fxti .X(,|Xy,. Xj s ( X h’ •••> X ir\ X ji’ • • • > X j0 

_ fx, „....x ir ,x J1 . Xj'(Xj,, •••* x ir , x y ,, ■■•,x ; ,) 

fxj, . Xj( x jr •••> x j ,) llll 


EXAMPLE 10 Let X k , X 5 be jointly discrete random variables. Take 
/■ = i = 2, (*,„ X, 2 ) = (X u X 2 ), and (X Jr X j2 ) = (X 3 , X s ); then 


/xi, X 2 IX 3 , X 5 ( X 1’ X 2I X 3 > * 5 ) — 


-/jfl, X 2 , X 3 , X 5 ( X 1’ X 2 1 X 3 > * 5 ) 
/X3. X 5 ( X 3 > X s) 


//// 


EXAMPLE 11 Suppose 12 cards are drawn without replacement from an 
ordinary deck of playing cards. Let be the number of aces drawn, 
X 2 be the number of 2s, A^ be the number of 3s, and X 4 be the number 
of 4s. The joint density of these four random variables is given by 

/xi,X 2 .X3,X„( X l> X 2 > X 3 > X 4) 


mm' 

U- 

36 \ 

- X 1 - X 2 - X 3 - X 4/ 


(“) 

i 
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where x f = 0, 1, 2, 3, or 4 and / = 1, ..4, subject to the restriction that 
£ x t < 12. There are a large number of conditional densities associated 
with this density; an example is 


fx 2,X4|X,,X 3 ( X 2 > X 4l X l> * 3 ) 


mm 

(12 — Xj — x 2 - 

- *3 - 

x)l 

V 


m 



\ 


(X) 

/ 36 

\12 — Xy — X 2 

- X 3- X J 


/ 44 ' 

\12 — Xj — x 3/ 

1 


where x ; = 0, 1, ..., 4 and x 2 + x 4 < 12 — X! — x 3 . //// 


3.2 Conditional Distribution Functions 
for Continuous Random Variables 


Definition 13 Conditional probability density function Let X and Y 
be jointly continuous random variables with joint probability density 
function f Xt y (x, y). The conditional probability density function of Y 
given X = x, denoted by f Y \ x (' I*), is defined to be 

r , 1 s fx,r{x,y ) 

(7) 

if f x {x) > 0, where / x (x) is the marginal probability density of X, and is 
undefined at points when f x (x) = 0. 

Similarly, 

/x| *x\y)= f y£ ) y) i f /r(T)> 0. (8) 

and is undefined if f Y (y) = 0 //// 


A|x(' I x ) is called a (conditional) probability density function and hence 
should possess the properties of a probability density function. fy\ x (-\x) is 
clearly nonnegative, and 


J fr\x(y\x) dy = j dy 

J-CC J -CO f X (x) 

1 r 00 


fx(x) 


f fxA x ’y) d y 

* — nr 


fx(x) = 
fx(x) 
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The density f Y \ X (' | x) is a density of the random variable Y given that x 
is the value of the random variable X. In the conditional density /y|^('|x), 
x is fixed and could be thought of as a parameter. Consider f Y \ x (' I *o)> that is, 
the density of Y given that X was observed to be x 0 • Now f Xt Y (x, >') plots as a 
surface over the xy plane. A plane perpendicular to the xy plane which inter¬ 
sects the xy plane on the line x = x 0 will intersect the surface in the curve 
fx, y(x o, y)- The area under this curve is 

fx, y( x o > y) dy = f x ( x o)- 

Hence, if we divide f Xt Y (x 0 , y) by f x (x 0 ), we obtain a density which is precisely 
fy\x(y\ x o)- 

Again, the conditional cumulative distribution can be defined in the 
natural way. 

Definition 14 Conditional continuous cumulative distribution If X and 

Y are jointly continuous, then the conditional cumulative distribution of Y 

given X = x is defined as 

I'YixO'l*) = f /r|x( z l x ) dz 

" GO 

for all x such that f x (x) > 0. //// 


EXAMPLE 12 Suppose/ x<y (x,y) = (x +#(o,i ) (t)/ ( o,i)0')' 

r / i \ (X +y)/(0.i)(x)/(0,i)(y) x + y r ,, 
/ ' li< " 1 —<«+»/,,„« ‘'<“• ■ M 

for 0 < x < 1. Note that 

f Y\x(y\ x ) = [ /r|x( z l x) dz 

d — CD 


r y x + z 1 

-- dz =-- (x + z) dz 

J o x + 1 x + i J 0 ' 


1 


x + i 


7 (xy + y / 2 ) for 0 < y < 1. 


IllI 


Conditional probability density functions can be analogously defined for 
fc-dimensional continuous random variables. For instance, 


/xi.x 2 .x 4 |X 3 ,x 5 (- ,c i> x 2 , x 4 |x 3 , x 5 ) = 
f°r/x 3 ,x 5 (x 3 , x 5 ) > 0 . 


fx i.X 2 ,X3.X 4 . x 5 (xi, x 2 , x 3 , x 4 , x 5 ) 
f X 3 ,x 5 (x 3 , x 5 ) 
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3.3 More on Conditional Distribution Functions 

We have defined the conditional cumulative distribution F y \ x (y\ x) for either 
jointly continuous or jointly discrete random variables. If X is discrete and Y 
is any random variable, then F Y \ x {y\ x) can be defined as P[ Y < y \ X = x] if x 
is a mass point of X. We would like to define P[Y <y\X = x] and more 
generally P[A | X = x], where A is any event, for X either a discrete or continu¬ 
ous random variable. Thus we seek to define the conditional probability of an 
event A given a random variable X = x. 

We start by assuming that the event A and the random variable X are 
both defined on the same probability space. We want to define P[A | AT = x]. 
If X is discrete, either x is a mass point of X, or it is not; and if x is a mass point 
of X, 


P\A | X = x] = 


P[A; X=x] 
P[X=x) ’ 


which is well defined; on the other hand, if x is not a mass point of X, we are 
not interested in P[A | Jk" = x]. Now if X is continuous, P[A \ X = x] cannot be 
analogously defined since P[X = x] =0; however, if x is such that the events 
{ x-h<X<x + h } have positive probability for every h > 0, then P[A \ X = x] 
could be defined as 


P[A | X = x] = lim P[A \x-h<X<x+h] (9) 

provided that the limit exists. We will take Eq. (9) as our definition of 
P\A | X = x] if the indicated limit exists, and leave P[A \ X = x] undefined other¬ 
wise. (It is, in fact, possible to give P[A | X = x] meaning even if P[X = x] = 0, 
and such is done in advanced probability theory.) 

We will seldom be interested in P[A | X = x] per se, but will be interested 
in using it to calculate certain probabilities. We note the following formulas: 

> (i) P[A]=f J P[A\X=x i ]f x (x i ) (10) 

i= 1 

if X is discrete with mass points x i9 x 2 , • • • • 

(ii) P[A]= T P[A\X=x]f x (x)dx (11) 

if X is continuous. 

(iii) P[A ;XeB]= £ P[A \ X = x ; ]A(x ; ) (12) 

{i :x,bB) 
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if X is discrete with mass points x t , x 2 , ■■■ ■ 

(iv) P[A ; X e B] = f P[A \ X = x]f x (x) dx (13) 

J B 

if X is continuous. 

Ai ' agh we will not prove the above formulas, we note that Eq. (10) is 
just tf ineorem of total probabilities given in Subsec. 3.6 of Chap. I and the 
others are generalizations of the same. Some problems are of such a nature 
that it is easy to find P[A \ X = x] and difficult to find P[A ]. If, however,/*(■) 
is known, then P[A] can be easily obtained using the appropriate one of the 
above formulas. 

Remark F x Y (x, y) = \t m F Y \ x {y\x')f x (x) dx' results from Eq. (13) by 

taking A = { Y < _y} and B = ( —co, x]; and F Y {y) = x F Y \ X {y\x)f x (x) dx 

is obtained from Eq. (11) by taking A = {Y < y}. //// 

We add one other formula, whose proof is also omitted. Suppose 
A = {h(X, Y) < z}, where h{-, •) is some function of two variables; then 

(v) P[A | X = x] = P[h(X, Y) < z\ X = x] = P[h(x, Y) < z \ X = x]. 

(14) 

The following is a classical example that uses Eq. (11); another example 
utilizing Eqs. (14) and (11) appears at the end of the next subsection. 


EXAMPLE 13 Three points are selected randomly on the circumference of 
a circle. What is the probability that there will be a semicircle on which 
all three points will lie? By selecting a point “ randomly,” we mean that 
the point is equally likely to be any point on the circumference of the 
circle; that is, the point is uniformly distributed over the circumference 
of the circle. Let us use the first point to orient the circle; for example, 
orient the circle (assumed centered at the origin) so that the first point 
falls on the positive x axis. Let X denote the position of the second point, 
and let A denote the event that all three points lie on the same half circle. 
X is uniformly distributed over the interval (0, In). According to Eq. (11), 
P[A\ = \P[A\X= x]f x (x) dx. Note that for 0 < x < n, P[A | X = x] = 
(71 — x + n)/2n since, given X = x, event A occurs if and only if the 
third point falls between x-n and n. Similarly, P[A \ X = x] = 
(x + 7i - n)/2n for 7i < x < 271. Hence P[A ] = \l K P[A \ X = x]( 1 /2n) dx = 
(1/27i){JS[(2ti - x)j2n] dx + |^(x/ 27 r) dx} = |. //// 
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3.4 -Independence 

When we defined the conditional probability of two events in Chap. I, we also 
defined independence of events. We have now defined the conditional distri¬ 
bution of random variables; so we should define independence of random 
variables as well. 

Definition 15 Stochastic independence Let (X u X 2 , ..., X k ) be a 
^-dimensional random variable. Xi, X 2 , ..., X k are defined to be 
stochastically independent if and only if 

F Xl .x k (*i, X k )= n F x t ( x t) (15) 

i= 1 

for all x u x 2 . x k . Ill I 

Definition 16 Stochastic independence Let (X k , X 2 , X k ) be a 
^-dimensional discrete random variable with joint discrete density func¬ 
tion f Xu Xk (-, ..., •)• X k , X k are stochastically independent if and 
only if 

fx u ...,x k ( x u ■ • • > x k) — Y\fx,( x t) ( 16 ) 

> = i 

for all values (*!,..., x k ) of (X k , ■■ ■, X k ). //// 


Definition 17 Stochastic independence Let (X u X k ) be a ^-dimen¬ 
sional continuous random variable with joint probability density function 
fx x k (‘> • • • * ')• A'i, ..., X k are stochastically independent if and only 

if 1 ’ ” 

fx,,...,X k ( X 1. x k) — n/xX*!') (17) 

i= 1 

for all x u .... x k . I III 

Remark Often the word “stochastically” will be omitted. //// 

We saw that independence of events was closely related to conditional 
probability; likewise independence of random variables is closely related to 
conditional distributions of random variables. For example, suppose X and 
Y are two independent random variables; then f Xt y (x, y) =fx( x )fy(y) by defini¬ 
tion of independence; however, f x Y {x, y) = h\ x (y\ x )fx( x ) by definition of 
conditional density, which implies that f Y \ X (y\ x ) = fr(y)\ that is, the conditional 
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density of Y given x is the unconditional density of Y. So to show that two 
random variables are not independent, it suffices to show that/y )x (_y|x) depends 
on x. 


EXAMPLE 14 Let X be the number on the downturned face of the first 
tetrahedron and Y the larger of the two downturned numbers in the ex¬ 
periment of tossing two tetrahedra. Are X and Y independent? Ob¬ 
viously not, since /y|, Y (2| 3) = P[Y = 2| A' = 3] = 0 #/>-( 2) = P[Y —2]=-^. 

nil 

EXAMPLE 15 Let f x y (x, j>) = (x + y)I ( o, i)(*K(o, i)O0- Are therefore X 
and Y independent? No, since f Y \^(y\x) = [(x + y)/(x + i)]/ (0 ,i)(j) f° r 
0 < x < 1, fy[x(y I x) depends on x and hence cannot equal f Y (y). //// 

EXAMPLE 16 Let /*, Y (x, y) = e ~ {x+y)I l0 , «,)(*V(o, X and Y are 

independent since 

fx,r(x, y) = [e'^o.ooiC^'^o.ao)^)] = fx(x)f Y (y) 
for all (x, y). //// 

Itcan be proved that if X u ■ • •, X k are jointly continuous random variables, 
then Definitions 15 and 17 are equivalent. Similarly, for jointly discrete 
random variables, Definitions 15 and 16 are equivalent. It can also be proved 

k 

that Eq. (15) is equivalent to P[X k e ; ...; X k e B k ]= J~[ e B : ] for sets 

i = 1 

By, ..., B k . The following important result is easily derived using the above 
equivalent notions of independence. 

Theorem 3 If X t , ..., X k are independent random variables and 
) are k functions such that Yj = g/Xj), j= 1, ..., k are 
random variables, then Y t , ..., Y k are independent. 

proof Note that if gJ i (B J ) = {z: g^z) e Bj), then the events 
{ Yj e Bj} and {X J egJ 1 (B J )} are equivalent; consequently, P[Yy e By ; ... ; 

Y k eB k ] = P[Xyegy\Byy, ... ; X k eg k \B k )\ = f\P[ Xj e gJ^Bj)] 

k 

= E[ P[Y s eB } ]. 

7=1 


//// 
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For k = 2, the above theorem states that if two random variables, say 
X and Y, are independent, then a function of X is independent of a function of 
Y. Such a result is certainly intuitively plausible. 

We will return to independence of random variables in Subsec. 4.5. 

Equation (14) of the previous subsection states that P[h{X, Y) < z\ X = x] 
= P[h(x, Y)<z\X = x ]. Now if X and Y are assumed to be independent, 
then P[h{x, Y) < z \ X = x] = P[h(x, Y) < z], which is a probability that may 
be easy to calculate for certain problems. 


EXAMPLE 17 Let a random variable Y represent the diameter of a shaft 
and a random variable X represent the inside diameter of the housing 
that is intended to support the shaft. By design the shaft is to have 
diameter 99.5 units and the housing inside diameter 100 units. If the 
manufacturing process of each of the items is imperfect, so that in fact Y 
is uniformly distributed over the interval (98.5, 100.5) and X is uniformly 
distributed over (99, 101), what is the probability that a particular shaft 
can be successfully paired with a particular housing, when “successfully 
paired ” is taken to mean that X - h < Y < X for some small positive 
quantity hi Assume that X and Y are independent; then 


P[X — h < Y<X]= C P[X-h<Y<X\X= x]/ x (x) dx 

^ - QO 

.101 

= P[x - h < Y < x]j dx. 

J 99 


Suppose now that h = 1; then 
l x - 98.5 


P[x — 1 < Y < x] = 




1 

2 


100.5 - (x - 1) 
2 


for 99 < x < 99.5 
for 99.5 < x < 100.5 
for 100.5 < x < 101. 


Hence, 

, 101 

P[X — 1 < Y < X] = P[ x - 1 < Y < x]i dx 

* QQ 


'99 

,*99.5 


= f i(x-98.5 )±dx 

j 99 

,100.5 101 

+ m dx + (iX100.5 - x + l)i dx =-&. 

J 99.5 •’100.5 

/III 
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4 EXPECTATION 

When we introduced the concept of expectation for univariate random variables 
in Sec. 4 of Chap. II, we first defined the mean and variance as particular expec¬ 
tations and then defined the expectation of a general function of a random vari¬ 
able. Here, we will commence, in Subsec. 4.1, with the definition of the 
expectation of a general function of a A:-dimensional random variable. The 
definition will be given for only those A:-dimensional random variables which 
have densities. 


4.1 Definition 

Definition 18 Expectation Let (X 1 ,...,X k ) be a ^-dimensional ran¬ 
dom variable with density f Xu ._ ,*„('> •••» ')• The expected value of a 
function g(-, ..., •) of the fc-dimensional random variable, denoted by 
S[g(X i, ..., X k )], is defined to be 

<f>lg(X ..., T*)] = Yj&( x u • • • > x k)fx x t (*i> • • •» *k) (18) 

if the random variable (X u ..., X k ) is discrete where the summation is 
over all possible values of (X k , ..., X k ), and 

#[g(X l ,...,X k )] 

9(x i, •••, **)/*,...,*„(* . x k )dx i ...dx k (19) 

if the random variable (X k ,.... X k ) is continuous. //// 

In order for the above to be defined, it is understood that the sum and 
multiple integral, respectively, exist. 



Theorem 4 In particular, if g(x k , ...,x k ) = x i , then 

i>•••>**)] = <Z[Xi] = P-Xj • (20) 

proof Assume that (X k , ..., X k ) is continuous. [The proof for 
(X k ,..., X k ) discrete is similar.] 

£[g(X l ,...,X k )] = f J ...j x J Xi . x ( Xl , ..., x k ) dx,. ... dx k 

f co 

= J xJ Xl (x t ) dx t = ff[X t ] 

J — oo 
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using the fact that the marginal density f x .(x ;) is obtained from the joint 
density by 

co *ao 

••• fx 1 ....,xS x i’ ■..,**)••• dxi-i ■ dx i+l ...dx k . //// 

— oo * — co 

Similarly, the following theorem can be proved. 

Theorem 5 If g(x u . . . , x k ) = (x t - ^[T,]) 2 , then 

S[g{X u T*)] = g[(X, - S[X t ]) 2 ] = var [X,]. //// 

We might note that the “ expectation ” in the notation of Eq. (20) 
has two different interpretations; one is that the expectation is taken over the 
joint distribution of X lt ..., X k , and the other is that the expectation is taken 
over the marginal distribution of X,. What Theorem 4 really says is that 
these two expectations are equivalent, and hence we are justified in using the 
same notation for both. 


EXAMPLE 18 Consider the experiment of tossing two tetrahedra. Let X 
be the number on the first and Y the larger of the two numbers. We gave 
the joint discrete density function of X and Y in Example 2. 

#[XY] = Y J *yfx, r(x, y) 

= 1 • 1(tV) + 1 ' 2(tV) + 1 ' 3(-,V> + 1 • 4(t»*) 

+ 2 ■ 2( 1 2 g ) + 2 • 3 (tV) + 2 • 4(tV) + 3 • 3(*) 

+ 3 • 4( 1 1 5 ) + 4 • 4(-iV) = W- 

<$[X + T] = (1 + lj-p^ + (1 + 2)t6 + (1 + 3)t6 + (1 + ^)t6 

+ (2 + + (2 + 3)^ + (2 + 4)-pg + (3 + 3)^ 

+ (3+4) 1 V + (4 + 4) 1 V = fa. 

S[X] = f, and S[ Y] = ; hence g[x+Y] = &[X] + S[ Y]. //// 


EXAMPLE 19 Suppose f x> Y (x, y) = (x + y)I { 0> d(x)/ (0 , i)(t)- 

<?[X Y] = xy{x + j) dx dy = 

‘'o 
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EXAMPLE 20 Let the three-dimensional random variable (Aj, X 2 , X 3 ) have 
the density 

fxt,x z ,x 3 (x i. x 2 , x 3 ) = 8 x t x 2 x 3 / (0i ijCx^/fo, i)(x 2 )f(o, i)( x 3 )- 

Suppose we want to find (i) £\3X y + 2^2 + 6 X 3 ], (ii) &[X 1 X 2 X 3 ], and 
(iii) &[XiX 2 ]. For (i) we have g(x u x 2 , x 3 ) = 3xj + 2x 2 + 6 x 3 and 
obtain 

it X 2 , X 3 )] = <S'[3X l + 2X 2 + 6X 3 ] 

= |o Jo Jo (3xi + 2 x 2 + 6x 3 )8xiX 2 x 3 dx\ dx 2 dx 3 = _ 3 ~- 

For (ii), we get 

<o[XiX 2 X 3 ] = Jo Jo Jo 8 x 2 x 2 x 3 dx j dx 2 dx 3 = -jy, 
and for (iii) we get <?[ X l X 2 ] = //// 


The following remark, the proof of which is left to the reader, displays a 
property of joint expectation. It is a generalization of (ii) in Theorem 3 of 
Chap. II. 


Remark 

c 2 , • • •, c m . 


m “j m 

& E c i9i(Xi’ ■ ■ ■. **) = E Cf <%)(*!. X k )] for constants 

L 1 J 1 


//// 


4.2 Covariance and Correlation Coefficient 


Definition 19 Covariance Let X and Y be any two random variables 
defined on the same probability space. The covariance of X and Y, 
denoted by cov [X, E] or <7 x ,y, is defined as 

cov [X, Y] = £[(X - g x )(Y-g Y )] (21) 

provided that the indicated expectation exists. //// 


Definition 20 Correlation coefficient The correlation coefficient, de¬ 
noted by p[X, E] or p x<Y , of random variables X and Y is defined to be 


Px, Y ~ 


cov [X, Y] 

O x Oy 


( 22 ) 


provided that cov [X, Y], o x , and o Y exist, and o x > 0 and a Y > 0. //// 
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Both the covariance and the correlation coefficient of random variables 
X and Y are measures of a linear relationship of X and Y in the following sense: 
cov [X, Y ] will be positive when X — p x and Y — p Y tend to have the same sign 
with high probability, and cov [X, Y] will be negative when X — p x and Y — p Y 
tend to have opposite signs with high probability, cov [X, 7] tends to measure 
the linear relationship of X and Y; however, its actual magnitude does not have 
much meaning since it depends on the variability of X and Y. The correlation 
coefficient removes, in a sense, the individual variability of each X and Y by 
dividing the covariance by the product of the standard deviations, and thus the 
correlation coefficient is a better measure of the linear relationship of X and Y 
than is the covariance. Also, the correlation coefficient is unitless and, as we 
shall see in Subsec. 4.6 below, satisfies — 1 <, p X Y < 1. 

Remark cov [X, Y] = S[(X - p x ){ Y - p Y )] = S[X Y] - p x p Y . 
proof S[(X - p x )( Y - p Y )] = S[ X Y - n x Y-p Y X + p x p Y ] 

= S[XY ] — Px<o[ Y] — p Y S[X] + p x p Y 
= S[XY]-p x p Y . mi 


EXAMPLE 21 Find p Xt Y for X, the number on the first, and Y, the larger of 
the two numbers, in the experiment of tossing two tetrahedra. We would 
expect that p Xt Y is positive since when X is large, Y tends to be large too. 
We calculated XY], &[X], and &[Y] in Example 18 and obtained 
<o[XY] = J#, S[X] = i, and S[ Y] = 4 $. Thus cov [X, Y] = 

= j-|. Now S[X 2 ]= 1 £ L and S[Y 2 ] =hence var [2f] = f and 
var [ T] = fi So - 


Px, r = 


10 
1 6 



2 

Til’ 


III! 


EXAMPLE 22 Find p Xt Y for and Y if A, Y (x, y) = {x + y )/ (0 ,1 )(*)/«>, dOO- 
We saw that S[XY] = i and S[X ] =S[Y] = in Example 19. ' Now 
S[X 2 ] = S[ Y 2 ] = -p 2 ; hence var [A'] = var [ Y] = AV- Finally 


Does a negative correlation coefficient seem right? 


//// 
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4.3 Conditional Expectations 

In the following chapters we shall have occasion to find the expected value 
of random variables in conditional distributions, or the expected value of one 
random variable given the value of another. 

Definition 21 Conditional expectation Let ( X, Y) be a two-dimensional 
random variable and g(-, •), a function of two variables. The conditional 
expectation of g(X,Y) given X = x, denoted by S\g(X, Y) \ X = x], is 
defined to be 


9 (x, y)f Y \x(y\x)dy (23) 

if (X, Y) are jointly continuous, and 

<t[g(x, Y) I X = x] = £ g(x, yj)f Y \x(yj I x) (24) 

if (X, Y) are jointly discrete, where the summation is over all possible 

values of Y. //// 

In particular, if g(x, y) = y, we have defined <Z[Y \ X— x] = &[Y |x], 
£[Y\x] and &[g(X, T)|x] are functions of x. Note that this definition can be 

generalized to more than two dimensions. For example, let (X k . X k , 

Y t , ..., Y m ) be a (k + m)-dimensional continuous random variable with density 
fxi x k , Yl i' m (xi> • • •» x k , y k , ..., y m ); then 

• • •, X k , Y k ,..., Y m ) | Xj, ..., x k ] 

GO 

g(x ,,..., x k , y k . yj 

" GO 

x /r I .r m |xi. x k (Yi, ■■ ■, y m \x 1 . x k ) dy k ... dy m . //// 



S\g(X,Y) \X=x] = j 


EXAMPLE 23 In the experiment of tossing two tetrahedra with X, the 
number on the first, and Y, the larger of the two numbers, we found that 


A|x(tI 2) = 


i 

.4 


for y = 2 
for y = 3 
for y = 4 


in Example 9. Hence g[ Y \X = 2] = Yy/yM T=2) = 2-H 3i + 4-j 
= J ^- //// 
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EXAMPLE 24 For f Xt Y (x 9 y) = (x + j>)/ (0 , i)(x)/ (0 , i)(y), we found that 


fY\x(y\x) = 


x + y 

x + i 


VdOO 


for 0 < x < 1 in Example 12. Hence 

1,1 x + y 


S[Y\X = x] = \ 1 y ^tldy = —— f- + h 
■Jo x+i y x+i\2 3/ 


for 0 < x < 1 


IIII 


As we stated above, <?[fif(E)|x] is, in general, a function of x, Let us 
denote it by h(x)', that is, h(x) = S[g{ T)| x]. Now we can evaluate the expecta¬ 
tion of h(X), a function of X, and will have &[h(X)\ = &[&[g(Y)\X]]. 

This gives us 


S[S{g{Y)\X]] = m(Xj\ 


f Hx)fx(x) dx 

* — CO 

f S\g{Y)\x]f x {x)dx 

" CO 

[ [f S'OO/vixO’l *) fx(x)dx 

J -co U'-oo 

f [ #(>’)/>' I x(y I x)A(x) rfy dx 

f 00 r 00 

1 9(y)fxM x ' y) dy dx 

" — GO * GO 

«^[^( T)]. 


Thus we have proved for jointly continuous random variables X and Y 
(the proof for X and Y jointly discrete is similar) the following simple yet very 
useful theorem. 


Theorem 6 Let (X, Y) be a two-dimensional random variable; then 


and in particular 


#[g{Y)} = £[£[g{Y)\X]}, 

(25) 

£[Y] = <?[<?[Y\ X]l 

(26) 


//// 


Definition 22 Regression curve d?[Y\X = x] is called the regression 
curve of Y on x. It is also denoted by Hy\x=x=I jl y\x- //// 
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Definition 23 Conditional variance The variance of Y given X — x is 
defined by var [Y\ X = x] = &[Y 2 \ X = x] — {£[ Y\X = x]) 2 . //// 


Theorem 7 var [ Y] = g [var [ Y | X]] + var [£[ Y\X]]. 

PROOF 

<?[var [Y\X]]= g[ g[ Y 2 \X]]~ g[(g[ Y\ X]) 2 ] 

= g[ Y 2 ] - (g[ Y]) 2 - g[(g[ Y\X]) 2 ] + (g[ Y]) 2 
= var [ Y] - g[(g[Y | T]) 2 ] + (g[g[Y\ X}]) 2 
= var [ T] — var [&[Y\ A"]]. //// 

Let us note in words what the two theorems say. Equation (26) states 
that the mean of Y is the mean or expectation of the conditional mean of Y, 
and Theorem 7 states that the variance of Y is the mean or expectation of the 
conditional variance of Y, plus the variance of the conditional mean of Y. 

We will conclude this subsection with one further theorem. The proof 
can be routinely obtained from Definition 21 and is left as an exercise. Also, 
the theorem can be generalized to more than two dimensions. 

Theorem 8 Let (X, Y) be a two-dimensional random variable and 

gi(-) and g 2 (') functions of one variable. Then 

CO <%i( Y) + g 2 ( Y)\ X = x] = g[ 01 (Y)\X = x] + <?[g 2 ( Y)\X = x]. 

(ii) £[ gi { Y)g 2 (X) \ X = x] = Y) \ X = x]. //// 


4.4 Joint Moment Generating Function and Moments 

We will use our definition of the expectation of a function of several variables 
to define joint moments and the joint moment generating function. 

Definition 24 Joint moments The joint raw moments of Aj, ..., \\ 
are defined by <%[X r j X 2 2 ■ • • X' k k ], where the rj s are 0 or any positive 
integer; the joint moments about the means are defined by 

«?[(*! ~ HxT ■■■(X k - uh 

Remark If r t = r s = 1 and all other r m ’s are 0, then that particular joint 
moment about the means becomes g[( Xi - - g Xj )], which is just 

the covariance between X t and X } . njj 
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Definition 25 Joint moment generating function The joint moment 
generating function of (A\ , ..., X k ) is defined by 

,...,t k ) = Ac\vY. t j X) > (27) 

L j = i J 

if the expectation exists for all values of t u .. t k such that — h < tj < h 
for some h >0,7=1,...,*. //// 

The rth moment of X s may be obtained from m Xu ,x k (h> ..., t k ) by 
differentiating it r times with respect to tj and then taking the limit as all the t’s 
approach 0. Also <f[X' A}] can be obtained by differentiating the joint moment 
generating function r times with respect to t t and 5 times with respect to tj and 
then taking the limit as all the t’s approach 0. Similarly other joint raw 
moments can be generated. 

Remark m x (ti) = m x y(fi,0) = limm* >-(?!, t 2 ),andm Y (t 2 ) = tn x y(0, t 2 ) 

r 2 -0 

= lim m x Y (t u t 2 ); that is, the marginal moment generating functions can 
< 1-0 

be obtained from the joint moment generating function. //// 

An example of a joint moment generating function will appear in Sec. 5 
of this chapter. 

4.5 Independence and Expectation 

We have already defined independence and expectation; in this section we will 
relate the two concepts. 

Theorem 9 l{ X and Y are independent and g k (‘) and g 2 {-) are two 
functions, each of a single argument, then 

£[gy{X)g 2 { Y)] = S[gfX)} ■ S[g 2 { Y)]. 

proof We will give the proof for jointly continuous random 
variables. 

*\ 0 i(X)ff 2 (Yyi = f f 9i(x)g 2 (y)f x , Y (x,y) dxd y 

= f f gi(x)g 2 (y)fx(x)f Y (y)dxdy 

* — GO ^ — GO 

<* CG /* CO 

= ffi(x)f x (x) dx ■ g 2 (y)fy(y) d y 

" co J — oo 

= /foe*)] • *tg 2 m 


III / 
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Corollary If X and Y are independent, then cov [X, E] = 0. 

proof Take g Y (x) = x - g x and 9iiy) = y ~ Hr, by Theorem 9, 

cov [X, Y] = d?[(X — ii x )( Y - n v )] = S[g^X)g 2 ( V)] 

= #[ gi (X)]S[g 2 (Y)l 

= £[X - g x ]- £[Y - g Y ] = 0 since $[X — ft x ] = 0. //// 


Definition 26 Uncorrelated random variables Random variables X and 
Y are defined to be uncorrelated if and only if cov [X, Y] = 0. I III 

Remark The converse of the above corollary is not always true; that is, 
cov [X, Y] = 0 does not always imply that X and Y are independent, as 
the following example shows. HI I 


EXAMPLE 25 Let U be a random variable which is uniformly distributed 
over the interval (0, 1). Define X = sin 2nU and Y = cos 2nU. X and 
Y are clearly not independent since if a value of X is known, then U is 
one of two values, and so Eis also one of two values; hence the conditional 
distribution of Y is not the same as the marginal distribution. $[ Y] = 
Jo cos 2 Ttu du = 0, and $ [2f] = Jo sin 2nu du = 0; so cov [X, E]= <?[XY ] = 
Jo sin 2 nu cos 2 nu du = \ Ji sin 4 nu du = 0. //// 


Theorem 10 Two jointly distributed random variables X and Y are 
independent if and only if m x Y (t u t 2 ) = m x (t 1 )m Y (t 2 ) for all fi, t 2 for 
which —h<t i <h,i= 1 ,2, for some h > 0. 

proof [Recall that m x {ty) is the moment generating function of X. 
Also note that m x (t Y ) = m x<Y {t l , 0).] X and Y independent imply that 
the joint moment generating function factors into the product of the 
marginal moment generating functions by Theorem 9 by taking g^x) = e' lX 
and g 2 (y) = e ,2y . The proof in the other direction will be omitted. 

Ill/ 

Remark Both Theorems 9 and 10 can be generalized from two random 
variables to k random variables. //// 
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4.6 Cauchy-Schwarz Inequality 

Theorem 11 Cauchy-Schwarz inequality Let X and Y have finite 
second moments; then ( &[XY§ 2 = \£[XY ]\ 2 < S\X 2 \S\Y 2 ], with equality 
if and only if P[ Y = cXJ = 1 for some constant c. 

proof The existence of expectations &[X], <%[Y], and <S[XY] 
follows from the existence of expectations &[X 2 ] and <o[ Y 2 ]. Define 
0 < h(t ) = S[(tX - Y) 2 ] = S[X 2 ]t 2 - 2<g[XY\t + S[ Y 2 ]. Now h(t) is a 
quadratic function in t which is greater than or equal to 0. If h(t) > 0, 
then the roots of h(t) are not real; so 4(d?[XY]) 2 — 4<S’[A' 2 ]<S’[ Y 2 ] < 0, 
or (<f[XY ]) 2 < <£’[X 2 ]<£[Y 2 ]. If h(t) = 0 for some t, say t 0 , then 
S[{t 0 X- Y) 2 ] = 0, which implies P[t 0 X = Y] = 1. //// 

Corollary \p x ,y\ ^ l.with equality if and only if one random variable 
is a linear function of the other with probability 1 . 

proof Rewrite the Cauchy-Schwarz inequality as \S[UV]\< 
S\U 2 ~\S\ V 2 ], and set U = X — p x and V = Y — p Y . //// 


5 BIVARIATE NORMAL DISTRIBUTION 

One of the important multivariate densities is the multivariate normal 
density, which is a generalization of the normal distribution for a unidimensional 
random variable. In this section we shall discuss a special case, the case of the 
bivariate normal. In our discussion we will include the joint density, marginal 
densities, conditional densities, conditional means and variances, covariance, 
and the moment generating function. This section, then, will give an example 
of many of the concepts defined in the preceding sections of this chapter. 


5.1 Density Function 


Definition 27 Bivariate normal distribution Let the two-dimensional 
random variable (X, Y) have the joint probability density function 


fx, y (*> y) —f( x ’ y) 


lna x a Y J\ - p 2 
2(1 — p 2 )L\ o x / a x a Y 


+ 




( 28 ) 
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z 



for — oo < x < oo, — oo < y < oo, where a Y , a x , p x , p Y , and p are con¬ 
stants such that — 1 < p < 1 , 0 < a Y , 0 <a x , —co<p x <co, and 
— co<p Y <co. Then the random variable (A', Y) is defined to have a 
bivariate normal distribution. HH 


The density in Eq. (28) may be represented by a bell-shaped surface 
2 = /(x, y) as in Fig. 7. Any plane parallel to the xy plane which cuts the 
surface will intersect it in an elliptic curve, while any plane perpendicular to the 
xy plane will cut the surface in a curve of the normal form. The probability 
that a point (X, Y) will lie in any region R of the xy plane is obtained by 
integrating the density over that region : 


/’[(A', y) is in /?] = jj/(x, y) dy dx. (29) 

R 

The density might, for example, represent the distribution of hits on a vertical 
target, where x and y represent the horizontal and vertical deviations from the 
central lines. And in fact the distribution closely approximates the distribution 
of this as well as many other bivariate populations encountered in practice. 

We must first show that the function actually represents a density by 
showing that its integral over the whole plane is 1; that is, 


f [ f(x,y)dydx = 1. (30) 

-00 *'-oo 

The density is, of course, positive. To simplify the integral, we shall substitute 

y - ry 


x ~ Rx 

« --— and 


v — ■ 


(31) 
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so that it becomes 


f” C - 2 = ^ e ~ liK1 ~ p2)Ku2 ~ 2l,uv+v ' ) dvdu. 

' - CO 2 nj\ — p 2 

On completing the square on u in the exponent, we have 

f°° f - 1 - -p 2 )][(«-p») 2 +(i -p 2 ), 2 i du 

2nJ\ - " 2 


and if we substitute 


w 


u pr 


and 


dw 


(in 


J\-P 2 sj\-p 2 

the integral may be written as the product of two simple integrals 


1 -CO 1 

f e~*' 2/2 dw f —- — e~ 1,2/2 dv, 

^ ^ . / 9-rr 


■Jin 


(32) 


both of which are 1, as we have seen in studying the univariate normal distri¬ 
bution. Equation (30) is thus verified. 


Remark The cumulative bivariate normal distribution 

F(x, y) = f(x', /) dx'^j dy' 

may be reduced to a form involving only the parameter p by making the 
substitution in Eq. (31). //// 


5.2 Moment Generating Function and Moments 

To obtain the moments of X and Y, we shall find their joint moment generating 
function, which is given by 


m Xtr (ti, t 2 ) = w(ti, f 2 ) = d’[e ,lX+ ' 2> '] = J j e Ux+, * y f(x, y) dy dx. 


Theorem 12 The moment generating function of the bivariate normal 
distribution is 


m(ti, t 2 ) = explt^x + t 2 p Y +i( f \a\ + ^Phh^x^r + tjoj)]. 


(33) 
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proof Let us again substitute for x and y in terms of u and v to 
obtain 

m{t u t 2 ) 

= e 'U>x + '2H Y f°° f°° J,«„u + t? a .,v * c -teia-0*mu*-2pm>+v*'> 

- co ^ - ao 2HyJ 1 — p 2 

(34) 

The combined exponents in the integrand may be written 
- 377 “—27 I" 2 - 2 P UV + v 2 ~ 2 (! - P 2 )h°x m — 2(1 — p 2 )t 2 a Y v], 

2(1 - p*) 

and on completing the square first on u and then on v, we find this expres¬ 
sion becomes 


2(1 — p 2 ) 


{[“ - pv ~ (1 - P 2 )<i<*x3 2 +(1 - P 2 )(v- pt^x - t 2 o Y ) 2 
— (1 — p 2 ){t\o\ + 2pt l t 2 a x a Y + t 2 a Y )}, 


which, if we substitute 


w = 


u - pv - (1 - p 2 )t t a x 


and 


z = v- pti(T x -t 2 a Y , 


V i - p 2 

becomes 

-jw 2 - jz 2 + + 2pt i t 2 a x a Y + t\a 2 Y ), 

and the integral in Eq. (34) may be written 
m(h, 1 2 ) = exp[i(t?<r| + 2pt l t 2 a x a Y + tjaj)] 


f r t 

* oo ^ — oo 27T 


w 2 /2 — z 2 /2 


dw dz 


— ex P[ ? iA £ x + h Py + i(t 2<T x + 2 pt l t 2 o x o Y + tjaj)] 
since the double integral is equal to unity. 


//// 


Theorem 13 If (X, Y ) has bivariate normal distribution, then 

m = p x , 

£[Y] = P Y , 

var [X] = <7$, 
var [Y] = o 2 y , 
cov [X, Y] = pa x o Y , 
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and 


Px, Y — P- 

proof The moments may be obtained by evaluating the appro¬ 
priate derivative of m(f 1 , f 2 ) at t, -0, t 2 = 0. Thus, 


g[X] = — 
3h 

s[x] ~diT 


< 1 ,< 2=0 


< 1,<2 = 0 


= Px 


2 . 2 
Px + Ox- 


Hence the variance of X is 


<n(X-ti x ) 2 ] = <nx 2 ]-n 2 x = a 2 x . 

Similarly, on differentiating with respect to t 2 , one finds the mean and 
variance of Y to be p Y and a\. We can also obtain joint moments 

${X r Y s \ 

by differentiating m(t u t 2 ) r times with respect to t y and s times with respect 
to t 2 and then putting and t 2 equal to 0. The covariance of X and Y is 


<o[(X — Px)(Y — /t r )J — &[XY — X/iy — Yfi x + PxPy] 


= S[XY]~hxPy 


d 2 

dh 


m(ti, t 2 ) 


<i 


- PxPy 

< 2=0 


— p<J x (Ty . 


Hence, the parameter p is the correlation coefficient of X and Y. //// 


Theorem 14 If (X, Y) has a bivariate normal distribution, then X and Y 
are independent if and only if X and Y are uncorrelated. 

proof X and Y are uncorrelated if and only if cov [X, Y] = 0 or, 
equivalently, if and only if p x Y = p = 0. It can be observed that if 
p — 0, the joint density /(x, y) becomes the product of two univariate 
normal distributions; so that p — 0 implies X and Y are independent. 
We know that, in general, independence of X and Y implies that X and Y 
are uncorrelated. /III 
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5.3 Marginal and Conditional Densities 


Theorem 15 If (X, Y) has a bivariate normal distribution, then the mar¬ 
ginal distributions of X and Y are univariate normal distributions; that is, 
X is normally distributed with mean p x and variance a\ , and Y is nor¬ 
mally distributed with mean p Y and variance a\. 

proof The marginal density of one of the variables X, for example, 
is by definition 

fx(x) = f f(x, y)dy; 

J “OO 

and again substituting 

y ~Py 
v =- 

ffy 

and completing the square on v, one finds that 


r°° 

fx(x) = j_ 


2na xy fl - p 2 


x exp 


1 ( x - Px \ 2 l ( x-p x \ 2 

. 2 \ a x ) 2(1 - p 2 ) \ P a x ) _ 


dv. 


Then the substitutions 


= v- p{x~ p x )/a x 

J\-P 2 

show at once that 


and 


dw = 


dv 

yr-v 


fx(x ) 



the univariate normal density. Similarly the marginal density of Y may 
be found to be 



Theorem 16 If (X, Y) has a bivariate normal distribution, then the 
conditional distribution of X given Y = y is normal with mean 
Px + (P a xl a Y)(y ~ Py) a nd variance a x (\ — p 2 ). Also, the conditional 
distribution of Y given X = x is normal with mean p Y + ( pa Y /a x )(x - p x ) 
and variance a Y ( 1 — p 2 ). 
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proof The conditional distributions are obtained from the joint 
and marginal distributions. Thus, the conditional density of X for fixed 
values of Y is 


fx\yi x \y) — 


fix, y) 
fyiy) ' 


and, after substituting, the expression may be put in the form 


fx\r(x\y) 

1 f If po x 

= ~7X - 7==^ ex P “ -T \ x ~ f*x -O - Hr) 

.2,71(7 1 — p 2 l 2 <t x (1 — p ) L 


(35) 


which is a univariate normal density with mean p x + {p<J x l<j Y ){y — p Y ) 
and with variance o x (l — p 2 ). The conditional distribution of Y may be 
obtained by interchanging x and y throughout Eq. (35) to get 


f Y \x(y\x) 

= wh?) h ? (x “ HI 


(36) 

III / 


As we already noted, the mean value of a random variable in a conditional 
distribution is called a regression curve when regarded as a function of the fixed 
variable in the conditional distribution. Thus the regression for X on Y = y in 
Eq. (35) is p x + (p<J x lo Y )(y — p Y ), which is a linear function of y in the present 
case. For bivariate distributions in general, the mean of X in the conditional 
density of X given Y = y will be some function of y, say g( •), and the equation 

* = giy) 

when plotted in the xy plane gives the regression curve for X. It is simply a 
curve which gives the location of the mean of X for various values of Y in the 
conditional density of X given Y = y. 

For the bivariate normal distribution, the regression curve is the straight 
line obtained by plotting 


P a x , 

x = g x + — (y - Py), 

<7y 


as shown in Fig. 8. The conditional density of X given Y = y, f x \ Y ix\y), is 
also plotted in Fig. 8 for two particular values y 0 and y t of Y. 
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PROBLEMS 

1 Prove or disprove: 

(a) If P[X > Y] = 1, then <S[X] > £[ Y]. 

(b) If £[X] > £[ Y], then P[X > Y] = 1. 

(c) If £[X] > £[ Y], then P[X > Y] > 0. 

2 Prove or disprove: 

(a) If F x (z) > F,(z) for all z, then S[Y] > £[X). 

(b) If S[ Y] > S[X], then F x {z) > F x (z) for all z. 

(c) I f S'[ Y\ > S’[X], then F x {z) > F t (z) for some z. 

(d) If F x (z) = F r (z) for all z, then P[X =Y}= 1. 

(e) If F x (z) > FAz) for all z, then P[X < Y] > 0. 

(/) If Y = X + 1, then F x (z) = F t (z + 1) for all z. 

3 If Xi and X 2 are independent random variables with distribution given by 
P[X t = — 1] = P[X, = 1] = i for / = 1, 2, then are X x and X t X 2 independent? 

4 A penny and dime are tossed. Let X denote the number of heads up. Then 
the penny is tossed again. Let Y denote the number of heads up on the dime 
(from the first toss) and the penny from the second toss. 

(a) Find the conditional distribution of Y given X = 1. 

(b ) Find the covariance of X and Y. 

5 If A' and Y have joint distribution given by 

fz, *fr, v) “ 2/(0,y)(x)/(0. ! )( v). 

(a) Find cov [X, YJ. 

( b ) Find the conditional distribution of Y given X = x. 

6 Consider a sample of size 2 drawn without replacement from an urn containing 
three balls, numbered 1, 2, and 3. Let X be the number on the first ball drawn 
and Y the larger of the two numbers drawn. 

(a) Find the joint discrete density function of X and Y. 

( b ) Find P[X = 11 Y = 3], 

(c) Find cov [X, yj. 
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7 Consider two random variables X and Y having a joint probability density 
function 

fx, r(x,y) = i*y/<o.x»0')i <o . 2 >(x). 

(a) Find the marginal distributions of X and Y. 

(b) Are X and Y independent? 

8 If X has a Bernoulli distribution with parameter p (that is, P[X = 1] =/> = 1 
-Ppr=0D, *[Y\ X=0] = 1, and g[Y\ X= 1] = 2, what is S[Y\1 

9 Consider a sample of size 2 drawn without replacement from an urn containing 
three balls, numbered 1, 2, and 3. Let X be the smaller of the two numbers 
drawn and Y the larger. 

(a) Find the joint discrete density function of X and Y. 

( b ) Find the conditional distribution of Y given X= 1. 

(c) Find cov [X, Y\. 

10 Let X and Y be independent random variables, each having the same geometric 
distribution. Find P[X = Y], 

11 If F(-) is a cumulative distribution function: 

(a) Is F(x, y) = F(x) + F{y) a joint cumulative distribution function ? 
ib) Is F[x, y) = Fix)F(y) a joint cumulative distribution function ? 

(c) Is F(x, y) = max [F(jc), F(y)] a joint cumulative distribution function ? 
id) Is Fix, y) = min [F(jc), F{y)\ a joint cumulative distribution function ? 

12 Prove 

F x ix) + F r {y) — 1 < F x . t{x, y) <VF x ix)F r (y) for all x, y. 

13 Three fair coins are tossed. Let X denote the number of heads on the first two 
coins, and let Y denote the number of tails on the last two coins. 

(a) Find the joint distribution of X and Y. 

ib) Find the conditional distribution of Y given that X = 1. 

(c) Find cov [X, Y\ 

14 Let random variable X have a density function /(•), cumulative distribution 
function F(-), mean p, and variance a 2 . Define Y = a + fiX, where a and /3 are 
constants satisfying — oo < a < oo and /3 > 0. 

(a) Select a and /3 so that Y has mean 0 and variance 1. 
ib) What is the correlation coefficient between X and T? 

(c) Find the cumulative distribution function of Y in terms of a, /3, and F(->- 
id) If X is symmetrically distributed about p, is Y necessarily symmetrically 
distributed about its mean? (Hint: Z is symmetrically distributed about 
constant C if Z — C and —(Z — C) have the same distribution.) 

15 Suppose that random variable X is uniformly distributed over the interval ( 0 , 1); 
that is, fxix) = /<o. i >(jc). Assume that the conditional distribution of Y given 
X x has a binomial distribution with parameters n and p = x; i.e., 

P[y=y! Z=jc] = x'(l — xY~ J fory = 0, 1. n. 
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(a) Find S[Y\. 

( b ) Find the distribution of Y. 

16 Suppose that the joint probability density function of (A", Y) is given by 
fx, y{x, y) = [1 - a(l - 2x)0 - 2y)]/<o, uWo. dCv), 
where the parameter a satisfies — 1 < a < 1. 

(a) Prove or disprove: X and Y are independent if and only if X and Y are un¬ 
correlated. 



An isosceles triangle is formed as indicated in the sketch. 

(b) If ( X, Y) has the joint density given above, pick a to maximize the expected 
area of the triangle. 

( c ) What is the probability that the triangle falls within the unit square with 
corners at (0, 0), (1, 0), (1, 1), and (0, 1)? 

*(d) Find the expected length of the perimeter of the triangle. 

17 Consider tossing two tetrahedra with sides numbered 1 to 4. Let Yi denote the 
smaller of the two downturned numbers and Y 2 the larger, 

(a) Find the joint density function of Y 2 and Y 2 . 

C b ) Find P[Yi > 2, Y 2 >2], 

(c) Find the mean and variance of Fi and Y z . 

(d) Find the conditional distribution of Y 2 given Y t for each of the possible 
values of Y t . 

(e) Find the correlation coefficient of Yi and Y 2 . 

18 Let A. t (x, y ) = e~ u+,) I <0 , «)WAo. oo>(y) 

(a) Find > 1 ]. (6) Find P[1 < X+ Y < 2], 

(c) Find P[X < Y \ X < 2 Y\. (d) Find m such that P[X + Y < m] = J. 

(e) Find jP[0 < X < 11 Y = 2], (/) Find the correlation coefficient of X and Y. 

*19 Let f x , r (x, y) -=e~ y (\ — e~ x )I (0 . y] (x)I l0 , ®)(y) + e~ x (\ — e~ y )I f0 . x] (y)I l0 , «>(x). 

(а) Show that/*, ,(•, •) is a probability density function. 

(б) Find the marginal distributions of X and Y. 

(c) Find [ y| X = x] for 0 < x. 

Id) Find P[X < 2, 7 < 2], 

(e) Find the correlation coefficient of X and Y. 

(/) Find another joint probability density function having the same marginals. 
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*20 Suppose X and Y are independent and identically distributed random variables 
with probability density function/(•) that is symmetrical about 0. 

(a) Prove thatP[| X+ Y\ <2\X\]> 

(b) Select some symmetrical probability density function /(•), and evaluate 
P[\X+Y\ <2\X\}. 

*21 Prove or disprove: If S[Y\ X] = X, S[X\ Y] = Y, and both S[X 2 ] and <?[y 2 ] are 
finite, then P[X = Y\ = 1. (Possible Hint: P[X = y] = 1 if var [X- Y] = 0.) 

22 A multivariate Chebyshev inequality: Let (X u ... , X m ) be jointly distributed with 
$[Xj] = Hj and var [Xj] = a] fory = l,.... m. Define Aj={\ Xj - pj | <Vmtoj}. 

m 

Show that P[ n A]] > 1 — t~ 2 , for />0. 
j=i 

23 Let /*(•) be a probability density function with corresponding cumulative dis¬ 
tribution function F x ( •). In terms of /*( •) and/or F x ( ■): 

(a) Find P[X > x 0 + A x | X >jr 0 ]. 

(b) Find P[x 0 < X < jc 0 + A* | X > jc 0 ], 

(c) Find the limit of the above divided by Ax as Ax goes to 0. 

( d ) Evaluate the quantities in parts (a) to (c) for f x (x) — Xe~ lx I, 0 . »)(x). 

24 Let N equal the number of times a certain device may be used before it breaks. 
The probability is p that it will break on any one try given that it did not break 
on any of the previous tries. 

(a) Express this in terms of conditional probabilities. 

(b ) Express it in terms of a density function, and find the density function. 

25 Player A tosses a coin with sides numbered 1 and 2. B spins a spinner evenly 
graduated from 0 to 3. jB’s spinner is fair, but /l’s coin is not; it comes up 1 
with a probability p, not necessarily equal to J. The payoff" X of this game is the 
difference in their numbers (/l’s number minus B’s). Find the cumulative dis¬ 
tribution function of X. 

26 An urn contains four balls; two of the balls are numbered with a 1, and the other 
two are numbered with a 2. Two balls are drawn from the urn without replace¬ 
ment. Let X denote the smaller of the numbers on the drawn balls and Y the 
larger. 

(a) Find the joint density of X and V. 

( b ) Find the marginal distribution of Y. 

(c) Find the cov [X, y], 

27 The joint probability density function of X and Y is given by 

fx, 1<X, y) = 3(x + y)/,0. d(* + y)ho. uMLo. dOO. 


(Note the symmetry in x and y.) 

(a) Find the marginal density of X. 

(b) Find P[X + y < .5]. 

(c) Find <?[ yj X = xj. 

(d) Find cov [X, Y]. 
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28 


29 


30 

31 

32 


33 


34 


35 

36 


37 


The discrete density of X is given by fx(x) = x/3 for x = 1, 2, and f r \ x(y\x) is 
binomial with parameters x and £; that is, 

fn ,(y|jc)=P[y = y|A' = x] = Q ( i)* 

for y — o,... t x and x = 1, 2. 

(a) Find $[X] and var [X], 

(b) Find S[ Y], 

(c) Find the joint distribution of X and Y. 

Let the joint density function of X and Y be given by fx, y(x, y ) = 8 xy for 0 < x 
< y < 1 and be 0 elsewhere. 

(a) Find S[Y\ X = x]. 

(b) Find S[XY\ X = jc]. 

(c) Find var [Y\X = x]. 

Let Y be a random variable having a Poisson distribution with parameter A. 
Assume that the conditional distribution of X given Y = y is binomially distrib¬ 
uted with parameters y and p. Find the distribution of X, if X = 0 when Y = 0. 
Assume that X and Y are independent random variables and X ( Y) has binomial 
distribution with parameters 3 and £ (2 and £). Find P[X = y]. 

Let X and Y have bivariate normal distribution with parameters fix — 5, p., = 10, 
ox = 1, and ol = 25. 

(a) lfp>0, find p when P[4 < Y < 16| X = 5] = .954. 

*{b) If p = 0, find P[X + y < 16]. 

Two dice are cast 10 times. Let X be the number of times no Is appear, and let 
y be the number of times two Is appear. 

(а) What is the probability that X and Y will each be less than 3 ? 

(б) What is the probability that X + Y will be 4 ? 

Three coins are tossed n times. 

(a) Find the joint density of X, the number of times no heads appear; Y, the num¬ 
ber of times one head appears; and Z, the number of times two heads appear. 

(b ) Find the conditional density of X and Z given Y. 

Six cards are drawn without replacement from an ordinary deck. 

(a) Find the joint density of the number of aces X and the number of kings Y. 

(b) Find the conditional density of X given Y. 

Let the two-dimensional random variable (X, Y) have the joint density 
fx, r(x, y) = H6 — x — y)I (0 , 2 )(x)/ (2 ,*)(y>. 

(a) Find ta[Y\ X = x], ( b ) Find S[Y 2 \ X = x], 

(c) Find var [Y\X = x], id) Show that S[Y] = g[g[ y| y]]. 

(e) Find [Zy| X = x]. 

The trinomial distribution (multinomial with k + 1 = 3) of two random variables 
X and y is given by 


fx, t(x, y) -- 


x\yl(n — x — y)l 


P x q\ 1 


<?)"- 


for x,y = 0, 1, ..., nandx + y <n, where 0^p, O^, andp + 2 < 1. 
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(a) Find the marginal distribution of y. 

( b) Find the conditional distribution of F given F, and obtain its expected 
value. 

(c) Find p[X, F]. 

38 Let (X, Y) have probability density function f x . r (x, y), and let u(X) and v(Y) be 
functions of X and Y, respectively. Show that 

S[u(X)v(Y)\ X = x] = u(x)S[v( Y)\X = x}. 

39 If X and Y are two random variables and <?[F| X = x] — p, where p does not 
depend on x, show that var [ F] = [var [ F| F]]. 

40 If X and F are two independent random variables, does $[ F| X = x] depend 
on x? 

41 If the joint moment generating function of (F, F) is given by m x , Y (t u h) = 
exp[J(tf +1|>] what is the distribution of F? 

42 Define the moment generating function of F| F = x. Does m y (t) = S[m Yl *(/)] ? 

43 Toss three coins. Let F denote the number of heads on the first two and F 
denote the number of heads on the last two. 

(a) Find the joint distribution of F and F. 

(,b) Find <?[F| F = 1]. 

(c) Find px.r- 

(d) Give a joint distribution that is not the joint distribution given in part (a) 
yet has the same marginal distributions as the joint distribution given in 
part (a). 

44 Suppose that F and F are jointly continuous random variables, fr\x{y\x) = 
1{\. *+ 1 ) 00 , and fx{x) — I(o, dC*)- 

(a) Find S[Y\. ( b ) Find cov [F, F], 

(c) Find P[X + F< 1]. id) Find /*, t (x\ y). 

45 Let (F, F) have a joint discrete density function 

fx. y(x, y) 

-PiY~ x P 2 l} —PiY~ y [\ + <x(x —pi)(y —p 2 )]I(o. u(x:)/<o, i)(y), 

where 0 </?i < 1, 0 <p 2 < 1, and — 1 < a < 1. Prove or disprove: F and F 
are independent if and only if they are uncorrelated. 

*46 Let (F, F) be jointly discrete random variables such that each F and F have at 
most two mass points. Prove or disprove: F and F are independent if and only 
if they are uncorrelated. 
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DISTRIBUTIONS OF FUNCTIONS OF 
RANDOM VARIABLES 


1 INTRODUCTION AND SUMMARY 

As the title of this chapter indicates, we are interested in finding the distribu¬ 
tions of functions of random variables. More precisely, for given random 
variables, say X l ,X 2 ,...,X n , and given functions of the n given random variables, 
say g x {-, .... •)» 9ii', - - - , ')» • • •» - ), we want, in general, to find the 

joint distribution of Y u Y 2 , Y k , where Yj = g/X ly X„),j = 1, 2, .... A:. 
If the joint density of the random variables X u X 2 ,X n is given, then theo¬ 
retically at least, we can find the joint distribution of Ij, Y 2 , , Y t . This 

follows since the joint cumulative distribution function of Y k , Y k satisfies 
the following: 

f Yi . Yk {y u . -.,y k ) = P[Yi Y k <y k ] 

= P[ gi (x u . ,.,x m )< yi ;.. .;g k (X u ..., X„)<y k ] 

for fixed y lt . y k , which is the probability of an event described in terms of 
X t , X„ , and theoretically such a probability can be determined by integrat¬ 

ing or summing the joint density over the region corresponding to the event. 
The problem is that in general one cannot easily evaluate the desired probability 
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for each y lt y k . One of the important problems of statistical inference, the 
estimation of parameters, provides us with an example of a problem in which 
it is useful to be able to find the distribution of a function of joint random 
variables. 

In this chapter three techniques for finding the distribution of functions of 
random variables will be presented. These three techniques are called (i) the 
cumulative-distribution-function technique, alluded to above and discussed in 
Sec. 3, (ii) the moment-generating-function technique, considered in Sec. 4, and 
(iii) the transformation technique, considered in Secs. 5 and 6. A number of 
important examples are given, including the distribution of sums of independent 
random variables (in Subsec. 4.2) and the distribution of the minimum and 
maximu m (in Subsec. 3.2). Presentation of other important derived distributions 
is deferred until later chapters. For instance, the distributions of chi-square, 
Student’s t, and F, all derived from sampling from a normal distribution, are 
given in Sec. 4 of the next chapter. 

Preceding the presentation of the techniques for finding the distribution 
of functions of random variables is a discussion, given in Sec. 2, of expectations 
of functions of random variables. As one might suspect, an expectation, for 
example, the mean or the variance, of a function of given random variables can 
sometimes be expressed in terms of expectations of the given random variables. 
If such is the case and one is only interested in certain expectations, then it is not 
necessary to solve the problem of finding the distribution of the function of the 
given random variables. One important function of given random variables 
is their sum, and in Subsec. 2.2 the mean and variance of a sum of given random 
variables are derived. 

We have remarked several times in past chapters that our intermediate 
objective was the understanding of distribution theory. This chapter provides 
us with a presentation of distribution theory at a level that is deemed adequate 
for the understanding of the statistical concepts that are given in the remainder 
of this book. 


2 EXPECTATIONS OF FUNCTIONS 
OF RANDOM VARIABLES 

2.1 Expectation Two Ways 

An expectation of a function of a set of random variables can be obtained two 
different ways. To illustrate, consider a function of just one random variable, 
say X. Let g (') be the function, and set Y=g{X). Since Y is a random 
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variable, S"[Y] is defined (if it exists), and S[g(X)] is defined (if it exists). For 
instance, if X and Y = g(X) are continuous random variables, then by definition 

S[Y]=\ yf Y (y) dy, (1) 

and 

S[g(X)] = f g(x)f x (x) dx; (2) 

but Y = g(X), so it seems reasonable that fi[ Y] = S[g{X)]. This can, in fact, 
be proved; although we will not bother to do it. Thus we have two ways of 
calculating the expectation of Y = g(X ); one is to average Y with respect to the 
density of Y, and the other is to average g(X) with respect to the density of X. 

In general, for given random variables X t ,..., X n , let Y = g(X lt X n ); 
then <f[T] = S[g(X u X„)], where (for jointly continuous random variables) 


&[Y]= f yf Y {y)dy 

J — on 


(3) 


and 


.00 .00 

<$[g(X 1 ,..., 3f„)] = J_ ... J_ g(x l ,...,x n )f Xl .x„(Xi,..., X n )dx 1 ...dx„. 


-oo — oo 


(4) 


In practice, one would naturally select that method which makes the 
calculations easier. One might suspect that Eq. (3) gives the better method of 
the two since it involves only a single integral whereas Eq. (4) involves a multiple 
integral. On the other hand, Eq. (3) involves the density of Y, a density that 
may have to be obtained before integration can proceed. 


EXAMPLE 1 Let X be a standard normal random variable, and let g{x) = x 2 . 
For Y = g(X) = X 2 , 


&[Y]= f yf Y (y) dy, 

- CO 

and 


${g(X)] = S[X 2 ] = r X 2 f x (x) dx. 

- CO 

Now 

£[X 2 ]= f°° * 2 -4=r dx= 1 
J -°o Jin 
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and 

using the fact that Y has a gamma distribution with parameters r = \ and 
A = (See Example 2 in Subsec. 3.1 below.) //// 


2.2 Sums of Random Variables 

A simple, yet important, function of several random variables is their sum. 


Theorem 1 For random variables X t , ..., X n 

L i J i 


(5) 


and 


var 


( 6 ) 


var 


t x] = Z var f*.-] + 2 Z Z c ° v t*(, Xji 

1 J 1 i<j 

proof That <f ^ A',j = Y . ] follows from a property of expec¬ 

tation (see the last Remark in Subsec. 4.1 of Chap. IV). 

[z *<] = i{i x> -*[i *«]) 2 ]=^[(z (x, - <™) 2 ] 

= e \z i(X,~-nxj])] 

Li=i j=i J 

= Z z - s[x j)] 

i=ij=i 

= Z var[AJ + 2^Z cov[X f , XJ\. //// 


i= 1 


*<J 


Corollary If X l9 ..., X n are uncorrelated random variables, then 

varj£X;j = £var[A|]. //// 

The following theorem gives a result that is somewhat related to the above 
theorem inasmuch as its proof, which is left as an exercise, is similar. 
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Theorem 2 Let X u ..., X n and Y x , Y m be two sets of random vari¬ 
ables, and let a u .... a n and b u ..., b m be two sets of constants; then 

covf i^Xt'ibjYj] = i £ a ,bj co V [Z(, Yjl (7) 

Ll 1 J i=lj =1 

//// 


Corollary If X u X n are random variables and a u a n are 
constants, then 


var 


n 1 n n 

= I I 

. 1 J i= 1 J= 1 

n 


a t oj cov[X f , X ^ 

var[X ; ] + 'Z'Z a i a J cov[X,, XJ. 


( 8 ) 


l= 1 i&j 

In particular, if X lt ..., X n are independent and identically distributed 

n 

random variables with mean n x and variance <rf and if X n — (1 /«) £ X f , 

i 

then 




and 


var [XJ 



(9) 


proof Let m — n , Y , = X t , and bj = a it i = 1, . .., n in the above 
theorem; then 

[ n ~| r n m 

=cov^a i X i ,f j b J Yj , 

and Eq. (8) follows from Eq. (7). To obtain the variance part of Eq. (9) 
from Eq. (8), set a t = \/n and cr| = var [XJ. The mean part of Eq. (9) 
is routinely derived as 

<^[*j = - # [e xj = -£ $ [xj=I; n x =n x . mi 

Corollary If X t and X 2 are two random variables, then 

var [X! ±X 2 ] = var [XJ + var [X 2 ] ±2cov [X„ X 2 ], (10) 

//// 


Equation (10) gives the variance of the sum or the dilference of two ran¬ 
dom variables. Clearly 


S[X t ± X 2 ] = (f[XJ ± <f[X 2 ], 


( 11 ) 
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2.3 Product and Quotient 

In the above subsection the mean and variance of the sum and difference of two 
random variables were obtained. It was found that the mean and variance of 
the sum or difference of random variables X and Y could be expressed in terms 
of the means, variances, and covariance of X and Y. We consider now the 
problem of finding the first two moments of the product and quotient of X and Y. 

Theorem 3 Let X and Y be two random variables for which var [XY] 

exists; then 

£[XY] = fi x + cov [X, Y], (12) 

and 

var [X Y] 

= nl var [AT] + var [ FJ + 2/i x /i y cov [X, Y] 

- (cov [X, Y]) 2 + g[(x- »x) 2 (Y- Hr) 2 ] + (13) 

2 I*Y*[(X - Hx) 2 ( Y - fly)] + 2n x g[(X - Hx)( Y - A*,) 2 ]. 

PROOF 

XY = fl X fly + (X — Hx)Hy + ( Y - Hy)Hx + (X — H X ){ Y - fly). 

Calculate [X Y] and S‘[(XY) 2 \ to get the desired results. //// 

Corollary If X and Y are independent, S[X Y] = n x H Y , and var [X T] = 

Hy var [Z] + n x var [ F] + var [A^ var [ F], 

proof If X and F are independent, 

<n(X - Hx) 2 ( Y - Hy) 2 ] = g[{X - Hx) 2 m Y - Hy) 2 } 

= var [X] var [F], 

<?[(* - nx)\ F - ny)] = S[(X - n x ) 2 ]g[ Y-h y ] = 0, 

and 

#[{X-h x )(Y-Hy) 2 ] = 0 . //// 

Note that the mean of the product can be expressed in terms of the means 
and covariance of X and F but the variance of the product requires higher-order 
moments. 

In general, there are no simple exact formulas for the mean and variance 
of the quotient of two random variables in terms of moments of the two random 
variables; however, there are approximate formulas which are sometimes useful. 



3 


CUMULATIVE-D1STRIBUTI0N-FUNCTI0N TECHNIQUE 181 


Theorem 4 


and 


var 


A-] x f^-^coy[X, y] + ^var[y], 
IY\ fiy fiy Hr 

m _ //ijA 2 / var[V] var[y] _ 2cov[X, F] \ 
[y\ \ju y / \ fix nl fixfiy ) 


(14) 


(15) 


proof To find the approximate formula for $[XIY], consider the 
Taylor series expansion of x/y expanded about (fi x , fi Y ); drop all terms of 
order higher than 2, and then take the expectation of both sides. The 
approximate formula for var [X/ y] is similarly obtained by expanding in 
a Taylor series and retaining only second-order terms. //// 


Two comments are in order: First, it is not unusual that the mean and 
variance of the quotient X/Y do not exist even though the moments of X and Y 
do exist. (See Examples 5, 23, and 24.) Second, the method of proof of 
Theorem 4 can be used to find approximate formulas for the mean and variance 
of functions of X and Y other than the quotient. For example. 


#[ 9 (X, y)] a g(nx > Hr)+ \ var[X] ~ g(x, y) 


i a 2 

+-var[Y] — g(x, y) 


HX,HY 

d 2 

+ cov[A, Y] —— g(x, y) 

Hx, fiy vy V x 


and 


(16) 


var [g(X, y)] a var[A]( — g(x, y) 


+ var [T](ir 9(x, y) 


dy- 


+ 2 cov|X y]l— g(x, y) 


Mat, My 


ay 


9(x, y) 


(17) 


3 CUMULATIVE-DISTRIBUTION-FUNCTION 
TECHNIQUE 


3.1 Description of Technique 

If the joint distribution of random variables X lf ..., X n is given, then, theoreti¬ 
cally, the joint distribution of random variables of Y u ..., Y k can be determined, 
where Yj = g/X u .... X n ), j = l, . .., k for given functions g t (-, ..., •)> • • •» 
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g k (-, ..., •)• By definition, the joint cumulative distribution function of 
Y k isF Yl ,..., Yk (y 1 ,...,y k )-=P[Y 1 <y 1 ;...; Y k <y k ]. But for each 
y u y k the event {T, <yi; Y k < y k ) = {g t (X u ..., X n ) £ y k ; 

g k (X u ..., X n ) <y k }. This latter event is an event described in terms of the 
given functions g t (-, .... g k ( -, ..., •) and the given random variables 

X u ..., X„. Since the joint distribution of X u ..., X n is assumed given, presum¬ 
ably the probability of event {gf X u ..., X n ) < yi ; ...; g k (X u ..., X n ) < y k ) 
can be calculated and consequently f yu...,yS > • ••> •) determined. The above 
described technique for deriving the joint distribution of Y lt ..., Y k will be 
called the cumulative-distribution-function technique. 

An important special case arises if Ic = 1; then there is only one function, 
say g(X..., X n ), of the given random variables for which one needs to derive 
the distribution. 


EXAMPLE 2 Let there be only one given random variable, say X, which has a 
standard normal distribution. Suppose the distribution of Y = g(X) — X 2 
is desired. 


Fy(y) 

= P[Y <y] = P[X 2 <y] = P[-J~y <X^J~y]= Q^y) - <S>(-Jy) 

f.Jy 1 

= 2 <Mu)du = 2 - 7 =e~ iu du 

J o J o yJ2n 

9 »y 1 1 1 

= —— —e~ iz dz=\ —— —7= e~ iz dz, for y > 0, 

N /2it J o2 Vz J or(i) v / 2 z 

which can be recognized as the cumulative distribution function of a 
gamma distribution with parameters r = \ and X = //// 


Other applications of the cumulative-distribution-function technique 
expounded above are given in the following three subsections. L 


3.2 Distribution of Minimum and Maximum 

Let X u .... X n be n given random variables. Define Y k = min [X u ..., X n ] 
and y„ = max[Xj, .... X n ]. To be certain to understand the meaning of 
y„ = max [X v ..., X n ], recall that each X, is a function with domain O, the 
sample space of a random experiment. For each wefl, X^co) is some real 
number. Now Y n is to be a random variable; that is, for each co, Y„(m) is to be 
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some real number. As defined, Y n (m) = max [X^co), ..., X n (co)\; that is, for 
a given co, Y n (co) is the largest of the real numbers X t (oj), ..., X n (co). 

The distributions of Y { and Y n are desired. F Yn (y) = T[ > „ < y] = 
P[X t <y;...; X n <, y] since the largest of the X,’s is less than or equal to y if 
and only if all the X t ’s are less than or equal to y. Now, if the X t s are assumed 
independent, then 

P[X , <y;...; X n <y]= fl^Xt ^ J'l = ft^O* 

i= 1 i= 1 

so the distribution of Y„ = max ..., TJ can be expressed in terms of the 
marginal distributions of X t , ..., X n . If in addition it is assumed that all the 
X u ..., X n have the same cumulative distribution, say F x (), then 

ft F x,(y) = 

i=i 

We have proved Theorem 5. 


.Theorem 5 If X t , ..., X n are independent random variables and Y n — 
max [Jfj, ..., X a ], then 

F r n (y) = ft F x,(y)- (18) 

i=l 

If X lt ..., X n are independent and identically distributed with common 
cumulative distribution function F x (), then 

F Y„(y) = [^(y)]". (19) 

//// 

Corollary If X u .... X n are independent identically distributed con¬ 
tinuous random variables with common probability density function/*(•) 
and cumulative distribution function F x (-), then 

frJy) = "[F x (y)r- i f x (y). ( 20 ) 

PROOF 

A„(y) = j y F Yn {y) = n[F x (y)Y-'f x (y). //// 

Similarly, 

Fr 1 (y) =p l Y i = 1 ~P[Yi >y] = 1 ~P[X t >y; ...; X n > y] 
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since. Y t is greater than y if and only if every X t > y. And if X u X n are 
independent, then 

l-P[X, > y ;... ,X n > y] = 1 - flP[X,. > y] = 1 - f[ [1 -F Xt (y)]. 

i=i <=i 

If further it is assumed that X lt .... X n are identically distributed with common 
cumulative distribution function./ 7 ^-), then 

1 - fl [' - Fx.(y)] = i - [i - 

>=i 

and we have proved Theorem 6. 

Theorem 6 If X lt X m are independent random variables and Y l = 
min [X),..., X„ ], then 

F tl (y) = 1 - ft [1 - F Xt (y)]. ( 21 ) 

i=i 

And if X lt ..., X n are independent and identically distributed with com¬ 
mon cumulative distribution function F x (-), then 

F Yi (y) = i [ 1 (22) 

//// 

Corollary If X lt X„ are independent identically distributed con¬ 
tinuous random variables with common probability density /*(•) and 
cumulative distribution F x (-), then 

f Yl (y) =-n[\ ~ F x (y)y-\f x (y). (23) 


PROOF 


A, O') = ~F Yi (y) = »[1 - F x (y)r l f x (y). //// 

dy 

EXAMPLE 3 Suppose that the life of a certain light bulb is exponentially 
distributed with mean 100 hours. If 10 such light bulbs are installed 
simultaneously, what is the distribution of the life of the light bulb that 
fails first, and what is its expected life? Let X t denote the life of the ith 
light bulb; then Y 1 = min [X lt X 10 ] is the life of the light bulb that 
fails first. Assume that the AVs are independent. 
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Now f Xi (x) =rhi e ^ x ho, «»(*). and 

F Xl (x) =( 1 - e-T*i»)/ ( o, „)(*); 
so 

f Yl (y) = iO(e~^ y ) i0 ~ i (rh> eT ^ y ) I (o. »)G0 
= tVV'^/(0,»)W. 

which is an exponential distribution with parameter A = xq ; hence X [] = 
1/A = 10. //// 


3.3 Distribution of Sum and Difference of Two Random Variables 

Theorem 7 Let X and Y be jointly distributed continuous random 
variables with density f x% y (x, y), and let Z = X + Y and V = X — Y. 
.Then, 

/z(z) = A, r(*, z - x) dx = fx, y(z - L, y) dy, (24) 

■— CO ~ — rr> 


/»CO 

A. rA, x-v)dx= f Xy Y (v + y, y ) dy. (25) 

“CO J -00 

% 

proof We will prove only the first part of Eq. (24); the others are 
proved in an analogous manner. 

T z (z) = P[Z < z] = P[X + Y < z] = JJ / v _ y (x, y) dx dy 

x + y<z 

= S_ ao [j_ fx,r(x, y) dy^dx 

= f x ’ y ( x ’ u ~ x ) ^ M J dx 

by making the substitution y = u — x. 

Now 

-*)*.]*} 

= J_ fx.v(.x,z - x )dx. //// 
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Corollary If X and Y are independent continuous random variables 
and Z = X + Y, then 

fr(z - x)fx(x) dx = f f x (z - y)f Y (y) dy. (26) 

-00 J — 00 


/z( z ) —fx + y( z ) — J 


proof Equation (26) follows immediately from independence and 
Eq. (24); however, we will give a direct proof using a conditional distribu¬ 
tion formula. [See Eq. (11) of Chap. IV.] 


P[Z <z]=P[X + Y <z]= f P[X + Y <z\X = x]f x (x) dx 

J -00 

= C P[x+ Y<z]f x (x)dx 

J -oo 

= f Fy( z - x)f x (x) dx. 

J —on 


Hence, 


f 00 dF y (z - x) 


-fx(x) dx 


-J.. 

. 0 ° 

= fy(z ~ x)f x (x) dx. 

J — on 


III / 


Remark The formula given in Eq. (26) is often called the convolution 
formula. In mathematical analysis, the function / z (-) is called the 
convolution of the functions f Y (-) and/*(•)• III / 


EXAMPLE 4 Suppose that X and Y are independent and identically dis¬ 
tributed with density f x (x) = f Y (x) = / (0 , 1( W' Note that since both X 
and Y assume values between 0 and 1, Z = X + Y assumes values between 
0 and 2. 


^00 a 00 

/z( z ) = J fy( z ~ x)f x (x) dx = j I(o, i)( z — X )I( o, i)( x ) dx 


00 

{d( 0 , z){x)l(O t i)(z) + / (z _ I)(x)/[ 1 J 2 )( z )} dx 

— oo 

= /( 0 , 1 )( Z ) | o dx + / tlj 2) (z) | ^ dx 
= zI(o. d( z ) + (2 - z)/ cl> 2) (z). 



1111 
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y 



FIGURE 1 


3.4 Distribution of Product and Quotient 

Theorem 8 Let X and Y be jointly distributed continuous random vari¬ 
ables with density / x _ y (x, y ), and let Z = XY and U = X/Y; then 

Uz) ‘ fi J h - '(*• 3 dx - C Wy fx - -(; • y ) (27) 

and 

\y\fx,r(uy>y)dy. (28) 



PROOF Again, only the first part of Eq. (27) will be proved. (See 
Fig. 1 for z > 0.) 


F z( z ) = P[Z < z] = jj f x> Y (x, y) dx dy 

xy<z 

= ^-00 \$ Z ,/ X - Y ^ X ' ^ dy \ dX+ Z / ^ ^ dy \ dX ’ 
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which on making the substitution u = xy 

= L [L H iu + [/." ;) H *■ 


hence 


/z(z) = 


dz 


“C FT A - 


//// 


EXAMPLE 5 Suppose X and Y are independent random variables, each 
uniformly distributed over the interval (0, 1). Let Z = XY and V = XIY. 

m -f-„T^ r *A x '$ dx 

= / ( o, o( z )f - I<z, i)W dx 

J 0 X 

= h o,i)(z)J ~ dx= -logz/ (0 , uOO- 

/u(m) = I \y\fx.r(uy, y)dy 
J - 00 

r 00 

= b I Ao. i )(“L)J(o, i >00 ^ ( se e Fig. 2) 

J ~ 00 

= f blb(o, d( m )^(o, uOO + J[i,co)( M )J(o, i/ujb)}^ 

^ — 00 

= / ( 0 , d (“){ 0 )' < J 3 '+ V .«»)(“){ 0 3 > dy 

= 2 ho, i>(“) + 2 (“) ^[1. «>(“)• 

Note that £[X/Y] = £[U\ = \ J‘ u du + i jTO/u) du = oo, quite dif¬ 
ferent from <f[X]/<f[L] = 1. //// 
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FIGURE 2 

4 MOMENT-GENERATING-FUNCTION TECHNIQUE 
4.1 Description of Technique 

There is another method of determining the distribution of functions of random 
variables which we shall find to be particularly useful in certain instances. This 
method is built around the concept of the moment generating function and will 
be called the moment-generating-function technique. 

The statement of the problem remains the same. For given random vari¬ 
ables X t , X n with given density f Xl , Xn (x 1; ..., x„) and given functions 

9i {',..., •), ...,g k (-, ..., •), find the joint distribution of =g 1 (X 1 ,..., X n ), 

Y k = g k (X t , ..X n ). Now the joint moment generating function of 
, ..., Y k , if it exists, is 

m Yi yJJ i > • • • > h) = <v[(? lYl+ + '“ yi ‘] 

g'l0l(xi,..„ *„)+•••+ fit9fc(xi x„) 

x/x 1 ,....x„(x l ,... > x„)fl dx t . (29) 

i= 1 

If after the integration of Eq. (29) is performed, the resulting function of 
b> • • •, t k can be recognized as the joint moment generating function of some 
known joint distribution, it will follow that Y u Y k has that joint distribu¬ 
tion by virtue of the fact that a moment generating function, when it exists, is 
unique and uniquely determines its distribution function. 

For k > 1, this method will be of limited use to us because we can recog¬ 
nize only a few joint moment generating functions. For k = 1, the moment 
generating function is a function of a single argument, and we should have a 
better chance of recognizing the resulting moment generating function. 
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This method is quite powerful in connection with certain techniques of 
advanced mathematics (the theory of transforms) which, in many instances, 
enable one to determine the distribution associated with the derived moment 
generating function. 

The most useful application of the moment-generating-function technique 
will be given in Subsec. 4.2. There it will be used to find the distribution of 
sums of independent random variables. 


EXAMPLE 6 Suppose X has a normal distribution with mean 0 and variance 1. 
Let Y = X 2 , and find the distribution of Y. 


m r(t) 


= @[e tV ] = f e' x2 -~=- e ix2 dx 

-J 2 n 

= jl r e - ixHi - 2 ,) d X 

sjln ^ - oo 
1 (1-2/)"* 


■Jin (1 — 2t) 

„ ( 1 _ 2 , r .„(_L) 


e -*«*<i-a> dx 

— 2t) 


for t<-. 


which we recognize as the moment generating function of a gamma with 
parameters r = £ and X = L (It is also called a chi-square distribution 
with one degree of freedom. See Subsec. 4.3 of Chap. VI.) //// 


EXAMPLE 7 Let Aj and X 2 be two independent standard normal random 
variables. Let Y l = g 2 {X\, X 2 ) = X) + X 2 and Y 2 = g 2 (X l , X 2 ) = 
X 2 — X !■ Find the joint distribution of Y x and Y 2 . 

m Yl ,yS.h,t 2 ) = S[e Y ^ Y ^ 


_ -l2) + ^2(U + «2)j 

= m Xl (tt - f 2 )™x 2 0i + t 2 ) 

(*i ~ h ) 2 ___ (L + h ) 2 


exp - 


exp 


, 2 2 , 2tj 2t 2 2 

= exp (tf + tl) = exp — exp — 

= W yi («l)Wy 2 (t 2 ). 
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We note that Y { and Y 2 are independent random variables (by Theorem 10 
of Chap. IV) and each has a normal distribution with mean 0 and variance 
2 . . //// 

In the above example we were able to manipulate expectations and avoid 
performing an integration to find the desired joint moment generating function. 
In the following example the integration will have to be performed. 


EXAMPLE 8 Let X t and X 2 be two independent standard normal random 
variables. Let Y = ( X 2 — Xi) 2 /2, and find the distribution of Y. 


m Y (t) = <5‘[exp Yt] = £ j exp 

C 00 r 00 1 n 

= J-J-^ eXP [ 


(*2 ~ ^.) 2 


2 

(x 2 - Xj) 


2 X 2 + x\ 
t ———- 


dxi dx 2 


= j_ J_ ^ ex p[-^[ x ?( 1 -0 + 2x 1 x 2 t + x|(l-t)]|dx 1 dx 2 

i jt.jt.fti \ 

dxi [ dx 2 


-{/->[-4-'H + ^)] 

r i r *2 (!-oi 

^—J exp 2(r—o 

X 7TT7 (C TIT “p[- if (*. + fi) 2 ] 
if ” 1 r i / t 2 \ i 

~7TT,LjTM-2{'-‘-rrM dx ‘ 

_ 1 . V1 - t y^-2 1 1 / 11-21 j\ 

y/l-t y/l-21 yi-t y27r-L eXP l 2 1 - t * 2 / dXi 


= (1 - 2 1)-* : 


for t <1/2, 


which is the moment generating function of a gamma distribution with 
parameters r = } and X = f hence. 


fy(y) = Ummy- i e-^I (0 ^ ) (y). 


ini 
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4.2 Distribution of Sums of Independent Random Variables 

In this subsection we employ the moment-generating-function technique to find 
the distribution of the sum of independent random variables. 

Theorem 9 If X u X n are independent random variables and the 
moment generating function of each exists for all -h < t <h for some 

n 

h > 0, let Y = £ then 

i 

n 

m Y (t) = <f[exp £ Xi t] = m x .(t ) for -h < t < h. 

i= 1 


PROOF 


m Y (t) = [exp £ X t t] = S 


= = n«xX0 

/= 1 i =1 


using Theorem 9 of Chap. IV. 


Illl 


The power and utility of Theorem 9 becomes apparent if we recall Theorem 
7 of Chap. II, which says that a moment generating function, when it exists, 

n 

determines the distribution function. Thus, if we can recognize [] m x .(t ) as 

/ — 1 

the moment generating function corresponding to a particular distribution, 

R 

then we have found the distribution of ^ X- t . In the following examples, we 
will be able to do just that. 


EXAMPLE 9 Suppose that X u ..., X n are independent Bernoulli random 
variables; that is, P[X t = 1 ] = p, and P[Xi = 0] = 1 — p. Now 

m Xl (t) =pe? + q. 

So n 

m i xXO = n = (pe* + 9)"’ 

1=1 

the moment generating function of a binomial random variable; hence 

n 

£ X t has a binomial distribution with parameters n and p. //// 
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EXAMPLE 10 Suppose that X u .... X n are independent Poisson distributed 
random variables, X i having parameter X t . Then 

m Xi (t ) = £ [e' x ‘] = exp X^e' - 1), 

and hence 

™sx,(0 = f[ m Xl (t) = PI exp Xi(e‘ - 1) = exp £ /t,(e' - 1), 

i=1 i=l 

which is again the moment generating function of a Poisson distributed 
random variable having parameter £ X t . So the distribution of a sum of 
independent Poisson distributed random variables is again a Poisson 
distributed random variable with a parameter equal to the sum of the 
individual parameters. //// 


EXAMPLE 11 Assume that X t , X n are independent and identically dis 
tributed exponential random variables; then 


So 


™x,(0 


X 

X — t‘ 


n 

mix.it) = 11 m Xi (t) = 
>= i 



which is the moment generating function of a gamma distribution with 
parameters n and X; hence, 

fix.ix) = — x"~ l e~ Xx I (0t m) (x), 

the density of a gamma distribution with parameters n and X. //// 


EXAMPLE 12 Assume that X 1 , X n are independent random variables 
and 

x i~ N(Hi, a?); 

then 

ciiXt ~ Nidi Hi, afaf), 

and 

tn aiXi (t) = exp (a,/!; t + iafaft 2 ). 
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Hence 

m EoiXi (0 = FI m a , Xl (t) = exp[(£ aiHi)t + i(£ afaf)t 2 ], 

i = 1 

which is the moment generating function of a normal random variable; so 

The above says that any linear combination (that is, £ a t X t ) of inde¬ 
pendent normal random variables is itself a normally distributed random 
variable. (Actually, any linear combination of jointly normally distri¬ 
buted random variables is normally distributed. Independence is not 
required.) In particular, if 

X ~ N(g x , al), Y ~ N(hy , 
and X and Y are independent, then 

X + Y ~ N(g x + n Y , + <j$), 

and 

X - Y~ N(ji x - n Y , a | + ffy). 

If X t , ..., X n are independent and identically distributed random vari¬ 
ables distributed N(ji, a 2 ), then 

n \ n / 

that is, the sample mean has a (not approximate) normal distribution. //// 

In the above examples we found the exact distribution of the sums of 
certain independent random variables. Other examples, including the impor¬ 
tant result that the sum of independent identically distributed geometric random 
variables has a negative binomial distribution, are given in the Problems. One 

n 

is often more interested in the average, that is, (l/«)£-X), than in the sum. 

1 

Note, however, that if the distribution of the sum is known, then the distribution 
of the average is readily derivable since 

F (i /n) z x,( z ) = =P\£x i <nz\ = F z Xi (nz). (30) 

In Examples 9 to 12 above, where we derived the distribution of a sum, we have 
in essence also derived the distribution of the corresponding average. One of 
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the most important theorems of all probability theory, the central-limit theorem, 
gives an approximate distribution of an average. We will state this theorem 
next and then again in our discussion of sampling in Chap. VI, where we will 
outline its proof. 

Theorem 10 Central-limit theorem If for each positive integer *, 
X u ..., X n are independent and identically distributed random variables 
with mean p x and variance a \, then for each z 

F Zn (z) converges to <l>(z) as n approaches oo, (31) 

where 

t (X n — S[X„]) X n — px ,,,, 

" “ " o x !jn ' "" 

We have made use of Eq. (9), which stated that S[X n ] = p x ar| d var [XJ = 
a\!n. Equation (31) states that for each fixed argument z the value of the 
cumulative distribution function of Z„, for n = 1, 2, ..., converges to the value 
4>(z). [Recall that <!>( ■) is the cumulative distribution function of the standard 
normal distribution.] 

Note what the central-limit theorem says: If you have independent random 
variables X t , ..., X n ,..., each with the same distribution which has a mean and 
variance, then X„=(l ln)Y J X i “standardized” by subtracting its mean and 
then dividing by its standard deviation has a distribution that approaches a 
standard normal distribution. The key thing to note is that it does not make 
any difference what common distribution the X u ..., X n , ... have, as long as 
they have a mean and variance. A number of useful approximations can be 
garnered from the central-limit theorem, and they are listed as a corollary. 

Corollary If X„ ..., X n are independent and identically distributed 
random variables with common mean p x an d variance a x , then 



//// 
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Equations (32) to (34) give approximate values for the probabilities of 
certain events described in terms of averages or sums. The practical utility 
of the central-limit theorem is inherent in these approximations. 

At this stage we can conveniently discuss and contrast two terms that are a 
vital part of a statistician’s vocabulary. These two terms are limiting distribution 
and asymptotic distribution. A distribution is called a limiting distribution 
function if it is the limit distribution function of a sequence of distribution 
functions. Equation (31) provides us with an example; <t>(z) is the limiting 
distribution function of the sequence of distribution functions F Zi (-), F Zl (-),..., 

F Zn ( •),_Also 4>(z) is called the limiting distribution of the sequence of random 

variables Z 1; Z 2 , ..., Z n ,_ On the other hand, an asymptotic distribution 

of a random variable, say Y n , in a sequence of random variables Y 1 , Y 2 ,... Y n ,... 
is any distribution that is approximately equal to the actual distribution of Y n 
for large n. As an example [see Eq. (33)], we say that X n has an asymptotic 
distribution that is a normal distribution with mean p x and variance a x fn. Note 
that an asymptotic distribution may depend on n whereas a limiting distribution 
does not (for a limiting distribution the dependence on n was removed in taking 
the limit). Yet the two terms are closely related since it was precisely the fact 
that the sequence Z ( , Z 2 , ..., Z„, ... had limiting standard normal distribu¬ 
tion that allowed us to say that X n had an asymptotic normal distribution with mean 
p x and variance a x /n. The idea is that if the distribution of Z„ is converging 
to 4>(z), then for large n the distribution of Z„ must be approximately distributed 
0, 1). But if Z„ = (A„ - Px)K a xly/ n ) is approximately distributed N{ 0, 1), 
then X„ is approximately distributed N(ji x > <Lc/«)- 

In concluding this section we give two further examples concerning sums. 
The first shows how expressing one random variable as a sum of other simpler 
random variables is often a useful ploy. The second shows how the distribution 
of a sum can be obtained even though the number of terms in the sum is also a 
random variable, something that occasionally occurs in practice. 


EXAMPLE 13 Consider n repeated independent trials, each of which has 
possible outcomes ..., o k+l . Let p } denote the probability of outcome 
dj on a particular trial, and let X } denote the number of the n trials resulting 
in outcome o } , j = k + \. We saw that (X k , X k ) had a multi¬ 

nomial distribution. Now let 

_ 1 1 if ath trial results in outcome Jj 
ia (0 otherwise; 
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then Xj = Z Z Ja . Now suppose we want to find cov [A), Xj\. In- 

a= 1 

tuitively, we might suspect that such covariance is negative since when one 
of the random variables is large another tends to be small. 


cov[X ; , Xj\ = cov 


Z p , Z Zja 


U >=1 


= Z Z cov [ z i/» > z >] 


P=1 a=1 


by Theorem 2. Now if a # ft, then and Z,- a are independent since they 
correspond to different trials, which are independent. Hence 


Z Z cov [ Z i/i, Z .?J = Z c°v[Z ia , ZyJ. 

0 =1 a= 1 a=l 

But cov [Z ia , Z,J = £[Z ia Z ja ] - <f[Z,. a ]<f[Z, a ], and <f[Z Ia Z,. a ] = 0 since 
at least one of Z ia and Z ja must be 0. Now #[Z, a ] = , and S[Z Ja ] = py, 

so cov [X t , Xj] = — npiPj. HI/ 


EXAMPLE 14 Let A),..., X n , ... be a sequence of independent and identi¬ 
cally distributed random variables with mean p x and variance a x . Let A 

i v 

be an integer-valued random variable, and define S N = Z X t ; that is, S N 

i=i 

is the sum of the first A Ays, where A is a random variable as are the 
Ays. Thus S N is a sum of a random number of random variables. Let us 
assume that A is independent of the A'/s. Then <^[5^] = S[S J [S N | A]] 
by Eq. (26) of Chap. IV. But *[S N | A = n] = £[X l + ■ ■ ■ + A'J = np x ; 
so *[S n \N] =Np x , and S[S[S n \N\\ = S[Np x ] = p x <f [A] = p N p x . Sim¬ 
ilarly, using Theorem 7 of Chap. IV, 

var [S N ] = <f [var [S n \N]] + var [*[S , JV | A]] 

= $[No x \ + var [A/i x ] 

= + p\ var [A] 

= l*N ' + <?N ' Px - 

Suppose now that A has a geometric distribution [see Eq. (14) of Chap. Ill] 
with parameter p, X, has an exponential distribution with parameter A, 
and we are interested in the distribution of S N . Further assume inde¬ 
pendence of A and the Ays. Now, for z > 0, 

P[S N < z] = £ P[S N < z\ A = n]P[N = n] 

n— 1 
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(by using the fact that a sum of independent and identically distributed 
exponential random variables has a gamma distribution) 


= I 


I. 


1 


.£1 L J o r>) 

2 co 

= f 

J 0 n= 1 

- i ife 


X n u n ~ 1 e- Xu du 


P( 1 - PT 


[Mi - p)Y 
(n — 1)1 


du 


p) Xu 


du = Xp J e Xpu du = 1 — e Xpz . 


That is, S N has an exponential distribution with parameter pX. Recall 
that [see Eq. (14) of Chap. Ill] £[N ] = \jp and var [jV| = (1 — p)/p 2 ', also, 
£[X\ = 1 /X, and var [X] = l/X 2 . So, as a check of the formulas for the 
mean and variance derived above, note that 


<r[s„] - p N p x 

and 

rrt , 22 11 1 — pi 1 

var [S N ] - p N a x + <j N p x _ - - + _ — 

which are the mean and variance, respectively, of an exponential distribu¬ 
tion with parameter pX. HI I 


5 THE TRANSFORMATION Y = g{X) 

The last of our three techniques for finding the distribution of functions of 
given random variables is the transformation technique. It is discussed in this 
section for the special case of finding the distribution of a function of a uni¬ 
dimensional random variable. That is, for a given random variable X we seek 
the distribution of Y = g{X) for some function g(-). Discussion of the general 
case is deferred until Sec. 6 below. Both the notation Y = g(X) and the notation 
y = <jr(;c) will appear in the ensuing paragraphs; y — g(x) is the usual notation 
for the function or transformation specified by g(-), and Y = g(X) defines the 
random variable Y as the function g(-) of the random variable X. 

5.1 Distribution of Y = g( Jf) 

A random variable X may be transformed by some function g(-) to define a 
new random variable Y. The density of Y, f Y (y), will be determined by the 
transformation g(-) together with the density f x (x) of X. 
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First, if X is a discrete random variable with mass points x,, x 2 , - 
then the distribution of Y = g(X) is determined directly by the laws of prob¬ 
ability. If X takes on the values x,, x 2 , - - • with probabilities/^x,),/ x (x 2 ),. - -, 
then the possible values of Y are determined by substituting the successive 
values of X in g('). It may be that several values of X give rise to the same 
value of Y. The probability that Y takes on a given value, say y } , is 

fy(yj) — X fx( x d- (35) 

li:g(xi)=yji 


EXAMPLE 15 Suppose X takes on the values 0, 1,2, 3, 4, 5 with prob¬ 
abilities/^),/ f (l),/ f (2),/ f (3),/ f (4), and/ x (5). If Y = g(X) = (X - 2) 2 , 
note that Y can take on values 0, I, 4, and 9; then f r (Q) =f x (2), 
fy( I) =/*(0 +/x(3)>/y(4) =/x(0) +/*(4), and A(9) =/*(5). //// 

Second, if X is a continuous random variable, then the cumulative dis¬ 
tribution function of Y = g(X) can be found by integrating/ x (x) over the appro¬ 
priate region; that is, 

F Y (y) = P[ Y < y] = P[g(X) < y] = [ f x (x) dx. (36) 

This is just the cumulative-distribution-function technique. 


EXAMPLE 16 Let X be a random variable with uniform distribution over the 
interval (0, 1) and let Y = g(X) = X 2 . The density of Y is desired. 
Now 

F Y (y) = P{Y < y] = P[X 2 < y] = f f x ( x ) dx = f^ dx = Jy 

•’(*: x 2 <x) •'0 

for 0 < y < 1; so 

Fy(y) = v / 3’f ( o,i ) (y) + /i, co ) (3 , )> • 

and therefore 

fr{y) = ~2J~ ho ’^y)- hh 

Application of the cumulative-distribution-function technique to find the 
density of Y = g(X), as in the above example, produces the transformation 
technique, the result of which is given in the following theorem. 
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Theorem 11 Suppose X is a continuous random variable with prob¬ 
ability density function f x ( •). Set X = {x:f x (x) > 0}. Assume that: 

(i) y = g{x) defines a one-to-one transformation of X onto 9). 

(ii) The derivative of x = g~ 1 (y) with respect to y is continuous and 
nonzero forye$), where g~ 1 (y) is the inverse function of g(x)\ that is, 
g~ x {y) is that x for which g{x) = y. 


Then Y — g(X) is a continuous random variable with density 


My) = 



Mg \y))iv(y)- 


proof The above is a standard theorem from calculus on the change 
of variable in a definite integral; so we will only sketch the proof. Consider 
the case when X is an interval. Let us suppose that g(x) is a monotone 
increasing function over X; that is, g'(x) > 0, which is true if and only if 
(d/dy)g-\y)> 0 over f). For ye 1 ?), F y (y) = P[g(X) < y] = P[X < 
ff~ 1 (y)]=Fx(g~ 1 (y})’ and hence My) = (d/dy)F r (y) = [(d/dy)g~ 1 (y)\ 
Mg~\y)) b y chain rule of differentiation. On the other hand, if g(x) 
is a monotone decreasing function over X, so that g'{x) < 0 and 
(d/dy)g~~ l (y) < 0, then F Y (y) = P[g(X) < y] = P[X > g~\y)] = 1 - F x 
(g ~ 1 (y))> a °d therefore/y(y) = -[(d/dy)g~\y)\f x (g~\y)) = \{d/dy)g~\y)\ 
fx(g~Hy)) forye ?). //// 


EXAMPLE 17 Suppose X has a beta distribution. What is the distribution of 
Y = -Iog e X? X = {x:f x {x) > 0} = {x:0 < x < I}. y=g(x) = -log e x 
defines a one-to-one transformation of X onto ^) = {y: y>0}. x = 
g~\y) = e~ y , so (d/dy)g~ 1 (y) = —e~ y , which is continuous and nonzero 
for y e ?). By Theorem 11, 


My) = 


d 

7 / w 

1 


B(fl, b) 


fx(g l (y))h(y) 


( c - T -l(l_ c -Y-l/( 0i0 o)(y) 


1 


B(n, b) 


e ay (\ -e y ) b l I (0 ' m) (y). 


In particular, if b = 1, then B(a, b) = 1/a; so f Y (y) = ae ay I( 0 ,oo)(y)> an 
exponential distribution with parameter a. //// 
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EXAMPLE 18 Suppose X has the Pareto density/ x (x) = 6x 6 1 1 {l , w} (x) and 
the distribution of Y = log e X is desired. 


fy(y) = 


dy 


g"\y) 


f x (g 1 (y)) I v(y) 


= oo)(e y ) = Oe 8 '/ [0> a} (y). 


mi 


The condition that g(x) be a one-to-one transformation of X onto “J) is 
unnecessarily restrictive. For the transformation y = g(x), each point in X 
will correspond to just one point in 9); but to a point in 9) there may correspond 
more than one point in X, which says that the transformation is not one-to-one, 
and consequently Theorem 11 is not directly applicable. If, however, X can be 
decomposed into a finite (or even countable) number of disjoint sets, say X u ..., 
X m , so that y = g(x) defines a one-to-one transformation of each 3E, into “i), 
then the joint density of Y = g(X) can be found. Let x = gi l {y) denote the 
inverse of y = g(x) for xeX s . Then the density of Y = g(X) is given by 


fr(y) = Z 



\fx(9i \y))iy(y)> 


(37) 


where the summation is over those values of / for which g(x) = y for some value 
of x in Xi. 


EXAMPLE 19 Let X be a continuous random variable with density 

and let Y = g(X) = X 2 . Note that if X is an interval containing both 
negative and positive points, then y = g(x) = x 2 is not one-to-one. How¬ 
ever, if X is decomposed into X 1 = {x: x e X, x < 0} and X 2 = {x: x e X, 
x > 0}, then y = g(x) defines a one-to-one transformation on each X t . 
Note that g\ '(y) = - Jy and g j ‘(y) = Jy. By Eq. (37), 


My) 



+ 2 


(0, oo 


>(>')• 


In particular, if 


fx(x) = ( i)e~ |x| , 


f r (y ) = i e~ 

2 V~y 




then 
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or, if 

fx(x) = ^(x+ l)/(—l, 2 )(x), 

then 


My) = 


112 r 112 
V2^y^~^ y + l) + 2^Jy9 {l+ ^ y) 


[ ( 0 , 1 


)(>') 


+ 


ri i 2 

5 ^ 5 “ + ^. 




//// 


5.2 Probability Integral Transform 

If X is a random variable with cumulative distribution F x (-), then F x (-) is a 
candidate for g(-) in the transformation Y = g(X). The following theorem 
gives the distribution of Y = F x (X) if F x ( ) is continuous. Since F x (-) is a 
nondecreasing function, the inverse function F x 1 (-) may be defined for any 
value of y between 0 and 1 as: F x l (y) is the smallest x satisfying ^(x) ^ y. 

Theorem 12 If X is a random variable with continuous cumulative dis¬ 
tribution function F x (x), then U = F x (X) is uniformly distributed over the 
interval (0, 1). Conversely, if U is uniformly distributed over the interval 
(0, 1), then X = F*'([/) has cumulative distribution function F x (-). 

proof P[U<u] = P[F x (X) <u] = P[X < F x *(«)] = F x (F x \u)) = 
u forO < u < 1. Conversely, P[X < x] = '(t/) < x] = P[U < F x (x)] 

= F x (x). //// 

In various statistical applications, particularly in simulation studies, it is 
oftendesiredtogeneratevaluesofsomerandom variable X. To generate a value 
of a random variable X having continuous cumulative distribution function 
F x ( -), it suffices to generate a value of a random variable U that is uniformly dis¬ 
tributed over the interval (0, 1). This follows from Theorem 12 since if U is a 
random variable with a uniform distribution over the interval (0, I), then X = 
F^iU) is a random variable having distribution F x ( ■). So to get a value, 
say x, of a random variable X, obtain a value, say u, of a random variable U, 
compute Fx'iu), and set it equal to x. A value nofa random variable [/is called 
a random number. Many computer-oriented random-number generators are 
available. 
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EXAMPLE 20 F x (x) = (1 - e~ Xx )I i0 , „£x). F x \y) = -(I/A) log, (I - y); so 
— (1/A) log, (1 — U) is a random variable having distribution (I — e Xx ) 
ho, <*>)(■*) if U is a random variable uniformly distributed over the interval 

( 0 , 1 ). IllI 


The transformation Y = F x (X) is called the probability integral trans¬ 
formation. It plays an important role in the theory of distribution-free statistics 
and goodness-of-fit tests. 


6 TRANSFORMATIONS 

In Sec. 5 we considered the problem of obtaining the distribution of a function 
of a given random variable. It is natural to consider next the problem of 
obtaining the joint distribution of several random variables which are functions 
of a given set of random variables. 


6.1 Discrete Random Variables 

Suppose that the discrete density function f Xy . x n ( x i> • • • > x„) of the n- 

dimensional discrete random variable (A), X n ) is given. Let X denote 
the mass points of (X t ,X n ); that is, 

X = {(*!, ..., x„):f Xt . Xn (x u ...,x n )> 0}. 

Suppose that the joint density of Tj = g 1 (X 1 , X n ), Y k = g k {X u ...,X„) 
is desired. It can be observed that Y u ..., Y k are jointly discrete and 

f’tlj =yi;...; Y k = y k ] =f Yl .y k (yi,•.., y k ) = X/x,. x n ( x i< •••> x „),where 

the summation is over those (x l5 ..., x „) belonging to 3c for which (j^i,..., y k ) = 

(gA X 1' • • • ’ X n)> • • • > 9ki X 1’ • • • » -^n))' 


EXAMPLE 21 Let (X lt X 2 , X 3 ) have a joint discrete density function given 
by 


(x 1; X 2 ,x 3 ) 

(0, 0 , 0) 

(0, 0, 1) 

(0, 1 , I) 

(1,0, I) 

(1.1,0) 

(I, I, I 

fxi,X 2 , x 3 ( x l> x 2 > X 3) 

* 

1 


i 

i 

1 
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Find the joint density of Y, = g 1 (X 1 , X 2 , X 3 ) = X x + X 2 + X 3 and 
Y 2 =g 2 (X u X 2 ,X 3 )=\X 3 -X 2 \. 

X = {(0, 0, 0), (0,0,1), (0,1,1), (1,0,1), (1,1,0), (1,1,1)}. 
A lp y 2 (0> 0) = /x„x 2 ,x 3 (0» 0) = 

Ai,r 2 (l> 1) = /x,,x 2 ,x 3 (0> 0.1) = i. 

Ai, f 2 (2, 0) = fxi,x 2 , x 3 (0.1,1) = I, 

A,,y 2 (2, 1) = A,,x 2 ,x 3 (1» 0.1) +A l ,x 2 ,x 3 (l> 1, 0) = i> 
and 

Ai,r 2 (3» 0) =/x,,x 2 ,x 3 (l» 1,1) = t- //// 

6.2 Continuous Random Variables 

Suppose now that we are given the joint probability density function 

f Xl .x„( x i> •••> x n ) °f the w-dimensional continuous random variable 

(*;,"•••> *„). Let 

X = {(*!,. ..,x n ):f Xt . Xn (x u 0}. (38) 

Again assume that the joint density of the random variables Y x = g 1 (X lf X n ), 
Y k = g k (X i ,X n ) is desired, where k is some integer satisfying I <k < n. 
If k < n, we will introduce additional, new random variables Y k+l = 
g k +i(Xi, X n ), ..., Y n =g n (X l , X n ) for judiciously selected functions 
g k+1 , ..., g n \ then we will find the joint distribution of Y u ..., Y n , and finally 
we will find the desired marginal distribution of Y u ..., Y k from the joint dis¬ 
tribution of Y u .... Y n . This use of possibly introducing additional random 
variables makes the transformation y t = g 2 {x 2 ,x„), y n = g n (x u ..., x„) 
a transformation from an n-dimensional space to an ^-dimensional space. 
Henceforth we will assume that we are seeking the joint distribution of I) = 
g 2 (X j, ..., V n ), ..., Y n = g n (X i , X n ) (rather than the joint distribution of 
Y u Y k ) when we have given the joint probability density of X u X n . 

We will state our results first for n = 2 and later generalize to n>2. 
Let/’ Xl> x 2 ( x D x 2 > he given. Set X = {(*i> x 2 ) : /x,,x 2 ( x i> * 2 ) > We want to 
find the joint distribution of Tj = gi(X k , X 2 ) and Y 2 = # 2 (^ 1 , X 2 ) for known 
functions #,(-, ■) and g 2 (-, •). Now suppose that y t =g 1 (x 1 , x 2 ) and y 2 = 
g 2 (x j, x 2 ) defines a one-to-one transformation which maps X onto, say, 9). 
x l and x 2 can be expressed in terms of y k and y 2 ; so we can write, say, x k = 
g\ l (yi, y 2 ) and x 2 = g 2 1 (>' 1 , y 2 ). Note that X is a subset of the x^ 2 plane and 
9) is a subset of the y^y 2 plane. The determinant 
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dx 1 

d Xl 

8yi 

dy 2 

8x 2 

dx 2 

Syi 

dy 2 


(39) 


will be called the Jacobian of the transformation and will be denoted by J. 
The above discussion permits us to state Theorem 13. 


Theorem 13 Let X, and X 2 be jointly continuous random variables 
with density function / Xl ,x 2 (xo * 2 )- Set X = {(x 1; x 2 )‘-fx t ,xi( x i’ * 2 ) > 0}- 
Assume that : 

(i) y 1 = fftfxu x 2 ) and y 2 = g 2 (x l5 x 2 ) defines a one-to-one transformation 
of X onto <I). 

(ii) The first partial derivatives of x t = g\ ‘(yi, y 2 ) and x 2 = g 2 \y u y 2 ) 
are continuous over $). 

(iii) The Jacobian of the transformation is nonzero for (y lt y 2 ) e 9). 

Then the joint density of T, = ^(X,, X 2 ) and Y 2 = g 2 (X lt X 2 ) is given 
by 

fruYiiyu y 2 ) = \ J\fx u xi(d i^iyu y 2 )’y^- (^0) 

proof We omit the proof; it is essentially the same as the derivation 
of the formulas for transforming variables in double integrals, which may 
be found in many advanced calculus textbooks. 9) is that subset of the 
y { y 2 plane consisting of points (y u y 2 ) for which there exists a (Xj, x 2 ) e X 
such that (y u y 2 ) = (^i(x l5 x 2 ), g 2 (x u x 2 )). 

i<o(yn y 2 ) = h(ffi\yu y 2 ), g1 ‘Oh, y 2 »- //// 


EXAMPLE 22 Suppose that Xj and X 2 are independent random variables, 
each uniformly distributed over the interval (0, I). Then f Xu x 2 (x u x 2 ) = 
^( 0 ,i)( x i)^( 0 ,i)(-^ 2 )- ={(xj, x 2 ): 0 < Xj < I and 0 < x 2 < I}. Let 

yi = 9 i(x 1 . x 2 ) = Xj + X 2 and y 2 = g 2 ( Xl , x 2 ) = x 2 - x t ; then x t = 

i(yi - y 2 ) = ffi\yi, y 2 ), and x 2 = ±(yi + y 2 ) = g 2 \yu y 2 \ 


8x t 

8x x 


i 

-i 

8y x 

dy 2 


dx 2 

8x 2 


i 

i 

1 

dy 2 



1 

2 
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jE and $) are sketched in Fig. 3. Note that the boundary x t = 0 of X goes 
into the boundary ±(y t — y 2 ) = 0 of 9), the boundary x 2 = 0 of X goes into 
the boundary + y 2 ) = 0 of < i), the boundary x x = 1 of X goes into the 
boundary i(y t — y 2 ) = I of $), and the boundary x 2 = I of X goes into 
the boundary £(.Vi + y 2 ) = 1 of $). Now the transformation is one-to-one, 
the first partial derivatives of 1 and g 2 1 are continuous, and the Jacobian 
is nonzero; so 

J^) = \J\fx,,x 2 (9i Hj'i* yi )’ 92 y 2 )) 

= [i for (y!,y 2 )e?) 

\o otherwise. //// 


EXAMPLE 23 Let X l and X 2 be two independent standard normal random 
variables. Let Y x = X 1 + X 2 and Y 2 = X l /X 2 . Then 


- 1 , \ 

-- 9 1 (yi, y 2 ) = — 


y 2 


ana 


i) 2 vzi> y 2/ 


1 + y 2 


y 2 y 1 

i + y 2 (i + y 2 ) 2 

i j'i 

l+y 2 (l+y 2 ) 2 


y\(y 2 + i) = _ y i 

(i + y 2 ) 3 (i +y 2 ) 2 ' 



6 


TRANSFORMATIONS 207 


fri.riiyit y 2 ) 

1 v, 1 1 ( 1 r o^) 2 , 

= (IT77 S “ p r 5 [(TT75 5+ (TTj>I) 

1 | >?! | [ 1 (i + y\)y\ ~\ 

~ 2n(\ + y 2 ) 2 eXp (_ 2 (1 + y 2 ) 2 J 

To find the marginal distribution of, say, Y 2 , we must integrate out y x ; 
that is 



r°° 

fy 2 (y 2 ) = fYt.yJji' y 2 ) dy x 


_1_1_ 

2?i(l + y 2 ) 2 



1 (i + y\)y\ 

2 (i +y 2 ) 2 



1 • 


Let 


then 


and so 


u 


1 0 +}j) ..2. 

2(1 +y 2 ) 271 ’ 


du 


(i + y^) 
(i + y 2 ) 2 


Ti dy, 


fr 2 (y 2 ) — ' 


1 


1 


2n (1 + y 2 ) 


(l + y 2 ) 2 

1 +yi 



1 

du = - ■ 

71 


1 

r+yT 


a Cauchy density. That is, the ratio of two independent standard normal 
random variables has a Cauchy distribution. //// 


EXAMPLE 24 Let X t have a gamma density with parameters and A for 
i = I,2. Assume that Aj and X 2 are independent. Again, we seek the 
joint distribution of = Xj + X 2 and Y 2 = Xj/X 2 . 




9 1 \yi,y 2 ) 


yiy 2 

1 + y 2 


and 


x 2 = ^- 1 (y 1 ,y 2 ) = r —^i 
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hence 


\J\ = 


£1 

(i+y 2 ) 2 ' 


fYuYiiy^yi) 


y i 


(1 + y 2 / r(n 


j |* l +»2 v "» _1 

o iffi ,Oi)/ (0>00 ) () , 2 ) 


r(«i)r(« 2 ) 

I +"2 


(i+y 2 r + " 2 

[ 1ni+H2 "1 

[b(»„ n 2 ) (l + , 2 r + ” ' K - ■ 


We see that f YuYl {yu y 2 ) = f Yi (yi)f Yi (y 2 )> so X, and L 2 are independent. 
Also, we see that the distribution of Y t = X x + X 2 is a gamma distribu¬ 
tion with parameters n x + « 2 and A. If n t = n 2 = 1, then E 2 is the ratio 
of two independent exponentially distributed random variables and has 
density 


fY&l) = 

a density which has an infinite mean. 


(1 +y 2 ) 2 


f(0,co)(>’2)> 


llll 


EXAMPLE 25 Let X f have a gamma distribution with parameters n t 
and A for i = 1, 2, and assume X { and X 2 are independent. Suppose now 
that the distribution of Y l = Xj(Xi + X 2 ) is desired. We have only 
the one function y, = g 2 (x u x 2 ) = xj(x t + x 2 )\ so we have to select the 
other to use the transformation technique. Since x l and x 2 occur in the 
exponent of their joint density as their sum, x t + x 2 is a good choice. 
Let y 2 <= Xi + x 2 ; then x t = y t y 2 , x 2 = y 2 -y x y 2 , and 
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Hence 

y 2) 


= y 2 =^— =^-r ^ +n \y 1 y 2 r- 1 (y 2 ~ * 2 ) 

r(«i)r(n 2 ) 


HnO r(n 2 ) 


- J'i)" 2 ' 1 J'5‘ + " 2 lg ^(O.D^i^o.oo^a) 


- stt:——;/ i‘ ‘(i-i'i)” 2 1 j (o,i)Cfi) 
L B ( n l. W 2.) J 


x \ r f l+ " 2 , y " 2 ‘ + " 2 - : l e~ Xy n (0 ' m) (y 2 ) . 

+ « 2 ) J 

It turns out that Y 2 and Y 2 are independent and Y { has a beta distribu¬ 
tion with parameters n { and n 2 . //// 


Of the three conditions that are imposed on the transformation y 1 = 
g^xu x 2 ) and y 2 =g 2 (x i , x 2 ), the sometimes restrictive condition that the 
transformation be one-to-one can be relaxed. For the transformation y t = 
x 2 ) and y 2 = g 2 (x t , x 2 ), each point in X will correspond to just one point 
in $); but to a point in <9 there may correspond more than one point in X, which 
says that the transformation is not one-to-one and consequently Theorem 13 as 
stated is not applicable. If, however, X can be decomposed into a finite number 
of disjoint sets, say jEj, ..., X m , so that y t = g l (x 1 , x 2 ) and y 2 = g 2 (x 1} x 2 ) 
define a one-to-one transformation of each X t onto < J) then the joint density of 
Yi =g 1 (X 1 , X 2 ) and T 2 = g 2 (X 1 , * 2 ) can be found. Let x t =gi i \y 1 , y 2 ) 
and x 2 = g 2 i{yi, y 2 ) denote the inverse transformation of 9) onto X t for / = 1, 
m, and set 


fyu 

Sgii 1 

dyi 

dy 2 

dgl f 1 

dg 2 ' 

dyi 

Sy 2 


Theorem 14 Let X^ and X 2 be two jointly continuous random variables 
with density function f Xl ,x 2 ( x i> * 2 )- Assume that X can be decomposed 
into sets X u ..., X m such that the transformation y 2 =^ 1 (x 1 , x 2 ) and 
y 2 = g 2 (x j, x 2 ) is one-to-one from 3E ; onto 9). Let x t = g\i{y 2 , y 2 ) and 
x 2 = g 2i 1 (y i , y 2 ) denote the inverse transformation of 9) onto X { , i = 
1, m. Assume that all first partial derivatives of g^ 1 and g 2! l are 
continuous on 3) and that J t does not vanish on 9), 1 = 1,..., m. Then 
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fr i, FiC-Vi) y 2 ) 


m 


I 
= 1 


\Ji\fx u xJ, 9 u\yu y 2 ), 92 i\yi, y 2 ))h(yu y 2 )- 


We illustrate this theorem with Example 26. 


(41) 

//// 


EXAMPLE 26 Assume that X 1 and X 2 are independent standard normal 
random variables. Consider the transformation = x\ + x\ and 
y 2 = x 2 , which implies Xj = ± v /j ? i ~ y\ and x 2 = y 2 so that the trans¬ 
formation is not one-to-one. Here 3£ = {(x 1 , x 2 ): — 00 < Xj < 00 , 
-00 <x 2 < 00 }, and «) = {(Li> y 2 ): 0 < y l < 00 , < y 2 <Jy J. 

If 3E is decomposed into jEj and 3E 2 , where 3Ej = {(x!, x 2 ): 0 ^ x^ < 00 , 
— 00 < x 2 < 00 } and jE 2 = {(Xj, x 2 ): — 00 < x t < 0, — 00 < x 2 < 00 } (in 
the terminology of Theorem 14, m = 2), then our transformation is one- 
to-one for onto i = 1,2. 9\\{y\> y 2 ) = > andgrji'O'i. ^ 2 ) = 

y 2 >3i 2 (yu y 2 )= -s/yi ~ yl, and g 22 (y ly y 2 ) = y 2 ; SO 


J 1 = 


-1 


1 / 2 \— 4 - ^11 

i(yi - y 2) * •—- 


0 


dy 2 

I 


= i(yi - yl) i , 


and 


J, = 


-iOr - yl) 

0 


2\~t 


dy 2 

1 


= -\(yi - yl) *• 


Hence, 

/f,,f 2 (> , i> y 2 ) = [l^i \fx,.x 2 ( 0 ii{y» y 2 ), 9 2 i(yi, ^2)) 

+ \JAfxi.x^id\ 2 (yu y 2 )> 9 22 {yi> .y2))]J®(}'i> >'2) 

1 1 


. e 


~iy 1 


Jyi - yl 2n _ 

for y x > 0 and - Jy^ < y 2 < -Jy^. Now 

f 00 

frXy 0 = fr t , rz(yi> y 2 ) dy 2 

* — m 


1 , 

■£' J 


1 


2?r \/lT - L2 

1 if V, 

-e ^‘larcsin— = 

-Jyi --Jy 


dy 2 


2n [ 

for y - >0 ’ 

an exponential distribution. 


IllI 
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Theorems 13 and 14 can be generalized from n = 2 to n > 2. We will 
state the generalization of Theorem 14. (Theorem 13 is a special case of Theorem 
14.) 


Theorem 15 Let X u X 2 ,..., X n be jointly continuous random variables 

with density function f Xl . x£ x u •••> *«)• Let X = {(x 1; ..., x„): 

f Xl x (xi x„) > 0}. Assume that X can be decomposed into sets 

X 1: X m such that y i = g x (x u x„), y 2 = g 2 ( x u • • • > x n)> •••>>'» = 

g n (x u .. x„) is a one-to-one transformation of 3Ej onto 9), i = 1, ..., m. 
Let Xj =^V(J' 1 > •••> yj’ •••> x n=9rt 1 (yu •••> y n ) denote the inverse 
transformation of U) onto X t , i = I , ..., m. Define 


tyu 

u 

39 u 

Syi 

dy 2 

3y n 

d 9n 

3g 2 i 

3 9 ii 

3y t 

dy 2 

3y„ 

39 ni' 

39m ' 

39 m 1 

3y i 

dy 2 

3y n 


for i = l,m. 

Assume that all the partial derivatives in J, are continuous over 
V) and the determinant J t is nonzero, i = 1, m. Then 


Iyi .> yn) 

m 

= X \Ji\fxi . xS.9u(yi> ■■■. y«)> -••, 9ni\yu > y«)) (42) 

i = 1 

for (y u y„) in 9). //// 


EXAMPLE 27 Let X lt X 2 , and X 3 be independent standard normal random 
variables, y! = x u y 2 = (x t 4- x 2 )/2, and y 3 = (xj + x 2 + x 3 )/3. Then 
x i — yu x 2 = 2y 2 - y lf and x 3 = 3y 3 — 2y 2 ; so the transformation is 
one-to-one. (m = 1 in Theorem 15.) 




1 0 0 

-1 2 0 

0-2 3 
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/r>,r 2 , r 3 (yu y 2 » F 3 ) 

= X 2 ) *3) 

= 6^-y—-j ex P{ — iiyi + (2y 2 — J'i) 2 + (3^3 — 2y 2 ) 2 ]} 

= 6^-jL) ex P[-i(2^ - 4 yi y 2 + 8 y 2 2 - I2y 2 y 3 + 9y|)]. 

The marginal distributions can be obtained from the joint distribution; 
for instance, 

Ziff's) 

f°° f" 

= fYuY2,Y i (yuy2,y 3 )dy 1 dy 2 

J ~ ao ^ — go 

= 6 (-^) exp[-£(6y| - I2y 2 y 3 + 9y|)] 

* (/_ ex P[“i( 2 >'i - 4 >'i >'2 + 2y 2 )]dy i j dy 2 

= ~J2 ^QT) CXP [ ~^ 6 ^ 2 “ l2yz y3 + 6j, 3)l ex Pf ~ i( 3 ^3)] dy 2 
= ( v /3/^r)exp[-| yj]; 

that is, Y 3 is normally distributed with mean 0 and variance j. //// 


PROBLEMS 

1 (a) Let Xi, X 2 , and X, be uncorrelated random variables with common variance 

<j 2 . Find the correlation coefficient between X t + X 2 and X 2 + X 3 . 

( b ) Let Xi and X 2 be uncorrelated random variables. Find the correlation 
coefficient between X 2 + X 2 and X 2 — X 1 in terms of var [ X J and var [X 2 ]. 

(c) Let Xi, X 2 , and X 3 be independently distributed random variables with 
common mean /j. and common variance a 1 . Find the correlation coefficient 
between X 2 — Xi and X 3 — X 2 . 

2 Prove Theorem 2. 

3 Let X have c.d.f. F x (-) = F(-). What in terms of F(-) is the distribution of 
XI l0 . rr >(X) = max [0, X\1 

4 Consider drawing balls, one at a time, without replacement, from an urn containing 
M balls, K of which are defective. Let the random variable X( Y) denote the 
number of the draw On which the first defective (nondefective) ball is obtained. 
Let Z denote the number of the draw on which the rth defective ball is obtained. 
(a) Find the distribution of X. 
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( b ) Find the distribution of Z. (Such distribution is often called the negative 
hypergeometric distribution.) 

(c) Set M = 5 and K = 2. Find the joint distribution of X and Y. 

5 Let Xi,..., X„ be independent and identically distributed with common density 

fx(x) — X 2 In, cc)(x). 

Set Y = min [Xi,XJ. Does S[Xi] exist? If so, find it. Does S[Y] exist? 
If so, find it. 

6 Let X and Y be two random variables having finite means. 

(a) Prove or disprove: $ [max [X, F]] > max [£[X], S'[ Y]]. 

(b) Prove or disprove: <f [max [X, Y] + min [X, y]] = <f[X] + <?[ Y\. 

7 The area of a rectangle is obtained by first measuring the length and width and 
then multiplying the two measurements together. Let X denote the measured 
length, y the measured width. Assume that the measurements X and Y are 
random variables with joint probability density function given by /*. Y (x, y) = 
kIi. 9 L . 1.1 £.](*)/[. a h\ i. 2 wj(y), where L and W are parameters satisfying L>W>Q 
and A: is a constant which may depend on L and W. 

(a) Find X[XY] and var [XY], 

(b) Find the distribution of XY. 

8 If X and Y are independent random variables with (negative) exponential dis¬ 
tributions having respective parameters A, and A 2 , find <f[max [X, y]]. 

9 Projectiles are fired at the origin of an xy coordinate system. Assume that the 
point which is hit, say ( X, Y), consists of a pair of independent standard normal 
random variables. For two projectiles fired independently of one another, let 
(A'i, yo and (X 2 , Y 2 ) represent the points which are hit, and let Z be the distance 
between them. Find the distribution of Z 2 . Hint: What is the distribution of 
(X 2 - A\) 2 ? Of ( y 2 - y,) 2 ? Is (X 2 - A',) 2 independent of ( Y 2 - Y 2 ) 2 ? 

10 A certain explosive device will detonate if any one of n short-lived fuses lasts 
longer than .8 seconds. Let A'i represent the life of the /th fuse) It can be as¬ 
sumed that each A'i is uniformly distributed over the interval 0 to 1 second. Fur¬ 
thermore it can be assumed that the AVs are independent. 

(a) How many fuses are needed (i.e., how large should n be) if one wants to be 
95 percent certain that the device will detonate? 

( b ) If the device has nine fuses, what is the average life of the fuse that lasts the 
longest? 

11 Suppose that random variable X„ has a c.d.f. given by [(« — 1 )/n] <E>(jc) + (1 /n)F„(x), 
where <D (•) is the c.d.f. of a standard normal and for each n F„(-) is a c.d.f. What 
is the limiting distribution of X„ ? 

12 Let X and Y be independent random variables each having a geometric distribu¬ 
tion. 

*(a) Find the distribution of X/(X (- Y). [Define X/(X + Y) to be zero if 
X + y=0.] 

(b) Find the joint moment generating function of X and X + Y. 
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13 Let Xi and X 2 be independent standard normal random variables. Let Y, = X, 
-f- X 2 and Y 2 = Xl -t X 2 . 


(a) Show that the joint moment generating function of Yi and Y 2 is 


14 

15 


exp [ii/( 1 — 2t 2 )] 


for — co < h < co and — oo < t 2 < £. 


1 - 2*2 

(b) Find the correlation coefficient of Y! and Y 2 . 

Let X and Y be independent standard normal random variables. Find the m.g.f. 
of XY. 

Suppose that Xi and X 2 are independent random variables, each having a standard 
normal distribution. 


(a) Find the joint distribution of (Y, + X 2 )V 2 and (X 2 — X t )/V2 

(b) Argue that 2X l X 1 and Yf — XI have the same distribution. Hint: 


16 


17 


18 


XI- X 2 t =2 


X t + X 2 X 2 - X t 

Vi Vi 


A dry-bean supplier fills bean bags with a machine that does not work very well, 

and he advertises that each bag contains 1 pound of beans. In fact, the weight 

of the beans that the machine puts into a bag is a random variable with mean 

16 ounces and standard deviation 1 ounce. If a box contains 16 bags of beans: 

(a) Find the mean and variance of the weight of the beans in a box. 

(b) Find approximately the probability that the weight of the beans in a box 
exceeds 250 ounces. 

(c) Find the probability that two or fewer underweight (less than 16 ounce) bags 
are in the box if the weight of beans in a bag is assumed to be normally 
distributed. 

Numbers are selected at random from the interval (0,1). 

(a) If 10 numbers are selected, what is the probability that exactly 5 are less than 
i? 

(b) If 10 numbers are selected, on the average how many are less than i? 

(c) If 100 numbers are selected, what is the probability that the average of the 
numbers is less than i? 

Let Xi denote the number of meteors that collide with a test satellite during the 


ith orbit. Let S„ = j Y; that is, S„ is the total number of meteors that collides 
1=1 

with the satellite during n orbits. Assume that the X t 's are independent and 
identically distributed Poisson random variables having mean A. 

(a) Find &[S^ and var [S„]. 

( b ) If n — 100 and A = 4, find approximately F[Sioo > 440], 

19 How many light bulbs should you buy if you want to be 95 percent certain that 
you will have 1000 hours of light if each of the bulbs is known to have a lifetime 
that is (negative) exponentially distributed with an average life of 100 hours? 

(a) Assume that all the bulbs are burning simultaneously. 

( b ) Assume that one bulb is used until it burns out and then it is replaced, etc. 
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20 ( a ) If X u ..., A"„ are independent and identically distributed gamma random 

variables, what is the distribution of Xi +-h X, ? 

(6) If Xi,..., X n are independent gamma random variables and if X, has param¬ 
eters r, and A, / = 1, what is the distribution of X, I-+ X. ? 

27 (a) If JT,, .... X n are independent identically distributed geometric random 
variables, what is the distribution of X t + • • • - X„2 

( b ) If X u .... Xn are independent identically distributed geometric random 
variables with density 0(1 — 0)* _1 / ( i. 2 ,. • -Ax), what is the distribution of 
X\ + '" + X„1 

(c) If Xi, Xn are independent identically distributed negative binomial 

random variables, what is the distribution of Xi H- X„? 

id) If Xt . X„ are independent negative binomial random variables and if Xi 

has parameters r f and p, what is the distribution of Xi -f--1- X„ ? 

*22 Kitty Oil Co. has decided to drill for oil in 10 different locations; the cost of 
drilling at each location is $10,000. (Total cost is then $100,000.) The prob¬ 
ability of finding oil in a given location is only |, but if oil is found at a given 
location, then the amount of money the company will get selling oil (excluding the 
initial $10,000 drilling cost) from that location is an exponential random variable 
with mean $50,000. Let Y be the random variable that denotes the number of 
locations where oil is found, and let Z denote the total amount of money received 
from selling oil from all the locations. 

(a) Find £[Z\. 

( b ) Find P[Z > 100,000 ] F = 1] and P[Z > 100,0001 Y = 2]. 

(c) How would you find P[Z > 100,000]? Is P[Z > 100,000] > J ? 

23 If X u X k are independent Poisson distributed random variables, show that 
the conditional distribution of X u given X t + • • • + X k , is binomial. 

*24 Assume that X lt ..., X k+ i are independent Poisson distributed random variables 
with respective parameters A,, ..., A k+1 . Show that the conditional distribution 

of X lt ..., X k given that Xi \ -+ X k+1 = n has a multinomial distribution 

with parameters n. A,/A,..., A k /A, where A = Ai H-t-A k+1 . • 

25 If X has a uniform distribution over the interval (-71-/2, tt/ 2), find the distribution 
of y = tan X. 

26 If X has a normal distribution with mean p and variance a 2 , find the distribution, 
mean, and variance of Y — e x . 

27 Suppose X has c.d.f. F x (x) = exp [-«?-<*-*>"]. What is the distribution of 
y = ex P [—(*-«)/$? 

28 Let X have density 


f x (x-, a, ti) = 


B(a, b) (1 + x )° +b 


* (0, co 


>(*), 


where a > 0 and b > 0. (This density is often called a beta distribution of the 
second kind.) Find the distribution of F = 1/(1)- X). 

29 If X has a uniform distribution on the interval (0, 1), find the distribution of 1 IX. 
Does S\\!X] exist? If so, find it. 
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30 ( a ) Give an example of a distribution of a random variable X for which S[\jX] 

is not finite. 

( b ) Give an example of a distribution of a random variable X for which S[\ / X] is 
finite, and evaluate <f[l/X]. 

31 If fx(x) = 2xe~* 2 fo. oo)(x), find the density of Y = X 2 . 

32 If X has a beta distribution, what is the distribution of 1 — X"! 

33 If f x (x) = e~*f o. co)W, find the distribution of X/(\ + X). 

34 If fx(x) = l/-7r(l + x 2 ), find the distribution of l/X. 

35 If fx(x) — 0 for x < 0, find the density of Y = aX 2 + b in terms of /*( •) for a > 0. 

36 If X has the Weibull distribution as given in Eq. (42) of Chap. Ill, what is the 

distribution of Y = aX b ? 

37 (a) Let Y = X 2 and fx(x) = e) (x), 0> 0. Find the c.d.f. of X and Y. 

Find the density of Y. 

(b) Let Y = X 2 and fxfx) = e fx), 6> 0. Find the c.d.f. and density 

of y. 

38 If X and Y are independent random variables, each having the same geometric 
distribution, find the distribution of Y — X. 

39 If X and Y are independent random variables, each having the same negative 
exponential distribution, find the distribution of Y — X. 

40 If X, Y, and Z are independent random variables, each uniformly distributed over 
(0, 1), what is the distribution of XY/Z2 

41 Assume that X and Y are independent random variables, where X has a p.d.f. 
given by f x (x) = 2x1 f0 . dW and Y has a p.d.f. given by f,(y) = 2(1 - y)I f0 . i>(y). 
Find the distribution of X + Y. 

*42 Let X and Y be independent Poisson distributed random variables. Find the 
distribution of Y — X. 

43 If f x (x) = 7 , 0 . uW, find the density of Y = 3X+ 1. 

*44 Let X and Y be two independent beta-distributed random variables. Is XY always 
beta-distributed? If not, find conditions on the parameters of X and Y that will 
imply that XY is beta-distributed. 

45 If fx. r(x, y) = e -< * +>) / (0 . rc) {x)I i0 . ^fy), find the density of Z = (X |- Y)/2. 

46 If fx. r(x, y) = 4jcFe -< * 2+>2| / (0 . a) {x)I i0 . „>(», find the density of V X 2 + Y 2 . 

47 If fx. r(.x, y) = 4xyl {0 . d(x)I {0 . nOO, find the joint density of X 2 and Y 2 . 

48 If fx. r(x, y) = 3jc/ ( o.*)0')/(o. d(jc), find the density of ZX- Y. 

49 If fx(x) = [(1 + jc)/ 2]/(_,. i)( jc), find the density of Y — X 2 . 

50 If fx. y(x, y) = 7(0. d(jc)7,o, „(y), find the density of Z, where 

z = (A'+y)7 < _ cc . 1] (A'+y) + (Jf+ y-i)7u.»,(z+ y). 

51 If fx. r(x, y) = e~ <x+y> I (0 , co)(jc)7 (0 . al (y), find the joint density of X and X + Y. 

52 If fx. r. z(x, y, z) = e~ (x+y+z> I f0 . ^(x)I i0 , Uy)ho. - fz), find the density of their 
average (X + Y + Z)l 3. 

53 If Xi and X 2 are independent and each has probability density given by 
Xe^ kx fo. ,r.)fx), find the joint distribution of Yi = XJX 2 and Y 2 = X i + X 2 and 
the marginal distributions of yi and Y 2 . 
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*54 Let X 2 and X 2 be independent random variables, each normally distributed with 
parameters = 0 and a 2 = 1. Find the joint distribution of Yi — Xi ■ X} and 
Y 2 = XJX 2 . Find the marginal distribution of Y, and of Y 2 . Are Y, and Y 2 
independent? 

55 If the joint distribution of X and Y is given by 

fx. y(x, y) = 2e- i3,+y> Ii0.yi(x)I l0 . „>(y), 

find the joint distribution of X and X \- Y. Find the marginal distributions of 
X and X+ Y. 

56 Let f x , Y (x, y) = K(x + y)I, 0 _ u (.x)/<o. i>(y)fo. n(x + y). 

(a) Find /*(•)• 

(b) Find the joint and marginal distributions of X \- Y and Y— X. 

57 Suppose/(*, mz(x, y\z) = [z + (1 — z)(x y)]I,i>, i,(x)I t o, i>(y) for 0 <z <2, and 

fz(z) = if 0 , 2](7). 

(a) Find S[X + 7], 

( b) Are X and Y independent? Verify. 

(c) Are X and Z independent ? Verify. 

(d) Find the joint distribution of X and X + Y. 

(e) Find the distribution of max [X, Y]\Z = z. 

(/) Find the distribution of {X + Y)\Z = z. 

58 A system will function as long as at least one Of three components functions. 
When all three components are functioning, the distribution of the life of each is 
exponential with parameter j|A. When only two are functioning, the distribution 
of the life of each of the two is exponential with parameter |A; and when only one 
is functioning, the distribution of its life is exponential with parameter A. 

(а) What is the distribution of the lifetime of the system ? 

(б) Suppose now that only one component (of the three components) is used at a 
time and it is replaced when it fails. What is the distribution of the lifetime 
of such a system? 

59 The system in the sketch will function as long as component C\ and at least one 
of the components C 2 and C 3 functions. Let X t be the random variable denoting 
the lifetime of component C, , i = 1, 2, and 3. Let Y= max [X 2 , X 3 ] and Z = 
min [X 2 , Y], Assume that the Xt s are independent (negative) exponential 
random variables with mean 1. 

(a) Find S’\Z] and var [Z], 

( b) Find the distribution of the lifetime of the system. 







218 distributions of functions of random variables 


V 


60 A system, which is composed of two components, will function as long as at least 
one of the two components functions. When both components are operating, 
the lifetime distribution of each is exponential with mean 1. However, the dis¬ 
tribution of the remaining lifetime of the good component, after one fails, is 
exponential with mean J. (The idea is that after one component fails the other 
component carries twice the load and hence has only half the expected lifetime.) 
Find the lifetime distribution of the system. 

*61 Suppose that (X, Y) has a bivariate normal distribution. Find the joint distribu¬ 
tion of aX 4 - bY and cX + dY for constants a, b, c, and d satisfying ad~ be -A 0. 
Find the distribution of aX + bY. Hint: Use the moment-generating-function 
technique and see Example 7. 

62 Let A', and X 2 be independent standard normal random variables. Let U be 
independent of X l and X 2 , and assume that U is uniformly distributed Over(0,1). 
Define Z = UX 2 + (1 - U)X 2 . 

(a) Find the conditional distribution of 2 given U -= u. 

(b) Find S[Z] and var [Z], 

* (c) Find the distribution of Z. 



VI 

SAMPLING AND SAMPLING DISTRIBUTIONS 


1 INTRODUCTION AND SUMMARY 

The purpose of this chapter is to introduce the concept of sampling and to present 
some distribution theoretical results that are engendered by sampling. It is a 
connecting chapter—it merges the distribution theory of the first five chapters 
into the statistical theory of the last five chapters. The intent is to present 
here in one location some of the laborious derivations of distributions that are 
associated with sampling and that will be necessary in our future study of the 
theory of statistics, especially estimation and testing hypotheses. Our thinking 
is that by deriving these results now, our later presentation of the statistical 
theory will not have to be interrupted by their derivations. The nature of the 
material to be given here is such that it is not easily motivated. 

Section 2 begins with a discussion of populations and samples. It ends 
with the definitions of statistic and of sample moments. Sample moments are 
important and useful statistics. Section 3 is devoted to the consideration of 
various results associated with the sample mean. The law of large numbers 
and the central-limit theorem are given, and then the exact distribution of the 
sample means from several of the different parametric families of distributions 
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introduced in Chap. Ill is given. Sampling from the normal distribution is con¬ 
sidered in Sec. 4, where the chi-square, F, and t distributions are defined. Order 
statistics are discussed in the final section; they, like sample moments, are impor¬ 
tant and useful statistics. 


2 SAMPLING 

2.1 Inductive Inference 

Up to now we have been concerned with certain aspects of the theory of prob¬ 
ability, including distribution theory. Now the subject of sampling brings us 
to the theory of statistics proper, and here we shall consider briefly one important 
area of the theory of statistics and its relation to sampling. 

Progress in science is often ascribed to experimentation. The research 
worker performs an experiment and obtains some data. On the basis of the 
data, certain conclusions are drawn. The conclusions usually go beyond the 
materials and operations of the particular experiment. In other words, the 
scientist may generalize from a particular experiment to the class of all similar 
experiments. This sort of extension from the particular to the general is called 
inductive inference. It is one way in which new knowledge is found. 

Inductive inference is well known to be a hazardous process. In fact, it 
is a theorem of logic that in inductive inference uncertainty is present. One 
simply cannot make absolutely certain generalizations. However, uncertain 
inferences can be made, and the degree of uncertainty can be measured if the 
experiment has been performed in accordance with certain principles. One 
function of statistics is the provision of techniques for making inductive in¬ 
ferences and for measuring the degree of uncertainty of such inferences: Un¬ 
certainty is measured in terms of probability, and that is the reason we have 
devoted so much time to the theory of probability. 

Before proceeding further we shall say a few words about another kind of 
inference —deductive inference. While conclusions which are reached by induc¬ 
tive inference are only probable, those reached by deductive inference are con¬ 
clusive. To illustrate deductive inference, consider the following two statements: 

(i) One of the interior angles of each right triangle equals 90°. 

(ii) Triangle A is a right triangle. 

If we accept these two statements, then we are forced to the conclusion: 

(iii) One of the angles of triangle A equals 90°. 
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This is an example of deductive inference, which can be described as a 
method of deriving information [statement (iii)] from accepted facts [statements 

(i) and (ii)]. Statement (i) is called the major premise, statement (ii) the minor 
premise, and statement (iii) the conclusion. For another example, consider the 
following: 

(i) Major premise: All West Point graduates are over 18 years of age. 

(ii) Minor premise: John is a West Point graduate. 

(iii) Conclusion: John is over 18 years of age. 

West Point graduates is a subset of all persons over 18 years old, and John 
is an element in the subset of West Point graduates; hence John is also an element 
in the set of persons who are over 18 years old. 

While deductive inference is extremely important, much of the new knowl¬ 
edge in the real world comes about by the process of inductive inference. In the 
science of mathematics, for example, deductive inference is used to prove the¬ 
orems, while in the empirical sciences inductive inference is used to find new 
knowledge. 

Let us illustrate inductive inference by a simple example. Suppose that 
we have a storage bin which contains (let us say) 10 million flower seeds which 
we know will each produce either white or red flowers. The information which 
we want is: How many (or what percent) of these 10 million seeds will produce 
white flowers? Now the only way in which we can be sure that this question 
is answered correctly is to plant every seed and observe the number producing 
white flowers. However, this is not feasible since we want to sell the seeds. 
Even if we did not want to sell the seeds, we would prefer to obtain an answer 
without expending so much effort. Of course, without planting each seed and 
observing the color of flower that each produces we cannot be certain of the 
number of seeds producing white flowers. Another thought which occurs is: 
Can we plant a few of the seeds and, on the basis of the colors of these few 
flowers, make a statement as to how many of the 10 million seeds will produce 
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white flowers ? The answer is that we cannot make an exact prediction as to 
how many white flowers the seeds will produce but we can make a probabilistic 
statement if we select the few seeds in a certain fashion. This is inductive 
inference. We select a few of the 10 million seeds, plant them, observe the 
number which produce white flowers, and on the basis of these few we make a 
prediction as to how many of the 10 million will produce white flowers; from a 
knowledge of the color of a few we generalize to the whole 10 million. We 
cannot be certain of our answer, but we can have confidence in it in a frequency- 
ratio-probability sense. 

2.2 Populations and Samples 

We have seen in the previous subsection that a central problem in discovering 
new knowledge in the real world consists of observing a few of the elements 
under discussion, and on the basis of these few we make a statement about the 
totality of elements. We shall now investigate this procedure in more detail. 

Definition 1 Target population The totality of elements which are under 
discussion and about which information is desired will be called the target 
population. //// 

In the example in the previous subsection the 10 million seeds in the stor¬ 
age bin form the target population. The target population may be all the dairy 
cattle in Wisconsin on a certain date, the prices of bread in New York City on a 
certain date, the hypothetical sequence of heads and tails obtained by tossing a 
certain coin an infinite number of times, the hypothetical set of an infinite 
number of measurements of the velocity of light, and so forth. The important 
thing is that the target population must be capable of being quite well defined; 
it may be real or hypothetical. 

The problem of inductive inference is regarded as follows from the point 
of view of statistics: The object of an investigation is to find out something about 
a certain target population. It is generally impossible or impractical to examine 
the entire population, but one may examine a part of it (a sample from it) and, 
on the basis of this limited investigation, make inferences regarding the entire 
target population. 

The problem immediately arises as to how the sample of the population 
should be selected. We stated in the previous section that we could make prob¬ 
abilistic statements about the population if the sample is selected in a certain 
fashion. Of particular importance is the case of a simple random sample, 
usually called a random sample, which is defined in Definition 2 below for any 
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population which has a density. That is, we assume that each element in our 
population has some numerical value associated with it and the distribution of 
these numerical values is given by a density. For such a population we define 
a random sample. 

Definition 2 Random sample Let the random variables X u X 2 , ■■■, X„ 
have a joint density•••> ) that factors as follows: 


fxi,X . .*„(-*• li X 2 i •••> X n) ~ fi X l)f(. X 2) . f( X n)> 

where /(•) is the (common) density of each X,. Then Aj, X 2 , • • •, X n is 
defined to be a random sample of size n from a population with density 
/(•)• //// 


In the example in the previous subsection the 10 million seeds in the stor¬ 
age bin formed the population from which we propose to sample. Each seed is 
an element of the population and will produce a white or red flower; so, strictly 
speaking, there is not a numerical value associated with each element of the 
population. However, if we, say, associate the number 1 with white and the 
number 0 with red, then there is a numerical value associated with each element 
of the population, and we can discuss whether or not a particular sample is 
random. The random variable X t is then 1 or 0 depending on whether the ;th 
seed sampled produces a white or red flower, i == 1,..., n. Now if the sampling 
of seeds is performed in such a way that the random variables X ly .X n are 
independent and have the same density, then, according to Definition 2, the 
sample is called random. 

An important part of the definition of a random sample is the meaning of 
the random variables X u ..., X„. The random variable X t is a representation 
for the numerical value that the /'th item (or element) sampled will assume. After 
the sample is observed, the actual values of X u ..., X n are known, and as usual, 
we denote these observed values by x u .... x„. Sometimes the observations 
x u .... x„ are called a random sample if x u ..., x„ are the values of X u ..., X n , 
where X t , .X„ is a random sample. 

Often it is not possible to select a random sample from the target popula¬ 
tion, but a random sample can be selected from some related population. To 
distinguish the two populations, we define sampled population. 

Definition 3 Sampled population Let X t , X 2 , .... X„ be a random 
sample from a population with density /(•); then this population is called 
the sampled population. //// 
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Valid probability statements can be made about sampled populations on 
the basis of random samples, but statements about the target populations are 
not valid in a relative-frequency-probability sense unless the target population 
is also the sampled population. We shall give some examples to bring out the 
distinction between the sampled population and the target population. 

EXAMPLE 1 Suppose that a sociologist desires to study the religious habits 
of 20-year-old males in the United States. He draws a sample from the 
20-year-old males of a large city to make his study. In this case the target 
population is the 20-year-old males in the United States, and the sampled 
population is the 20-year-old males in the city which he sampled. He can 
draw valid relative-frequency-probabilistic conclusions about his sampled 
population, but he must use his personal judgment to extrapolate to the 
target population, and the reliability of the extrapolation cannot be 
measured in relative-frequency-probability terms. //// 

EXAMPLE 2 A wheat researcher is studying the yield of a certain variety of 
wheat in the state of Colorado. He has at his disposal five farms scat¬ 
tered throughout the state on which he can plant the wheat and observe 
the yield. The sampled population consists of the yields on these five 
farms, whereas the target population consists of the yields of wheat on 
every farm in the state. //// 

This book will be concerned with the problem of selecting (drawing) a 
sample from a sampled population with density /(•), and on the basis of these 
sample observations probability statements will be made about /(•), or infer¬ 
ences about /(•) will be made. 

Remark We shall sometimes use the statement “population /(•)” to 
mean “ a population with density/(•).” When we use the word “popula¬ 
tion” without an adjective “sampled” or “target,” we shall always mean 
sampled population. //// 

2.3 Distribution of Sample 

Definition 4 Distribution of sample Let X lt X 2 ,..., X„ denote a sample 
of size n. The distribution of the sample X u ..., X n is defined to be the 
joint distribution of X u ..., X„. //// 

Suppose that a random variable A'has a density /(•) in some population, 
and suppose a sample of two values of X, say x x and x 2 , is drawn at random. 
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is called the first observation, and x 2 the second observation. The pair of 
numbers (x l5 x 2 ) determines a point in a plane, and the collection of all such 
pairs of numbers that might have been drawn forms a bivariate population. 
We are interested in the distribution (bivariate) of this bivariate population in 
terms of the original density /(■)• The pair of numbers (x ls x 2 ) is a value of the 
joint random variable (X u X 2 \ and X u X 2 is a random sample (of size 2) 
from/(-). By definition of random sample, the joint distribution of X 1 and 
X 2 , which we call the distribution of our random sample of size 2, is given by 
/xi,Xj(*i> * 2 ) =f(x 1 )f(x 2 ). 

As a simple example, suppose that X can have only two values, 0 and 1, 
with probabilities q = 1 — p and p, respectively. That is, JT is a discrete ran¬ 
dom variable which has the Bernoulli distribution 

/(x)=pV'T (0 , 1 ,(4 (1) 

The joint density for a random sample of two values from /(•) is 

fx l ,x 2 ( x 1 . * 2 ) =A.x 1 )f(x 2 )=p xl+X2 q 2 - xl ~ xl I {0 ' !,(*!>/( 0 , i,(x 2 ). (2) 

It is to be observed that this (bivariate) density is not what we obtain as the 
distribution of the number of successes, say Y, in drawing two elements from a 
Bernoulli population. The density of Y is given by 

AO) = Q P y <?~ y for y — 0, 1, 2. 

The single random variable Y equals X t + X 2 . 

It should be noted that f Xl ,x 2 ( x i> x 2 ) gives us the distribution of the sample 
in the order drawn. For instance, in Eq. (2), f XuXl ( 0, 1) = pq refers to the 
probability of drawing first a 0 and then a 1. 

Our comments for a random sample of size 2 generalize to a random sam¬ 
ple of size n, and we have the following remark. 

Remark If X u X 2 ,..., X n is a random sample of size n.from/(-), then 
the distribution of the random sample X u ..., X n , defined as the joint dis¬ 
tribution of X t , ..., X n , is given by f Xl Xn (x,, ..., x„) = /(xj) • • • -/(x„). 

Ill / 

Note that again this gives the distribution of the sample in the order drawn. 
Also, note that if X lt ..., X„ is a random sample, then X u X„ are stochasti¬ 
cally independent. 

We might further note that our definition of random sampling has auto¬ 
matically ruled out sampling from a finite population without replacement since, 
then, the results of the drawings are not independent. 
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2.4 Statistic and Sample Moments 

One of the central problems in statistics is the following: It is desired to study a 
population which has a density /(•; 9), where, the form of the density is known 
but it contains an unknown parameter 9 (if 9 is known, then the density func¬ 
tion is completely specified). The procedure is to take a random sample 
Xi, X 2 , • • •, X„ of size n from this density and let the value of some function, 
say /(*!, x 2 , ..., x n ), represent or estimate the unknown parameter 0. The 
problem is to determine which function will be the best one to estimate 0. This 
problem will be formulated in more detail in the next chapter. In this section 
we shall examine certain functions, namely, the sample moments, of a random 
sample. First, however, we shall define what we shall mean by a statistic. 

Definition 5 Statistic A statistic is a function of observable random 
variables, which is itself an observable random variable, which does not 
contain any unknown parameters. //// 

The qualification imposed by the word “observable” is required because 
of the way we intend to use a statistic. (“ Observable ” means that we can observe 
the values of the random variables.) We intend to use a statistic to make in¬ 
ferences about the density of the random variables, and if the random variables 
were not observable, they would be of no use in making inferences. 

For example, if the observable random variable A'has the density </>„,„ 2 (x), 
where ^ and <x 2 are unknown, then X — n is not a statistic; neither is X/a (since 
they are not functions of the observable random variable X only—they contain 
unknown parameters), but X, X + 3, and A' 2 + log A' 2 are statistics. 

In the formulation above, one of the central problems in statistics is to 
find a suitable statistic (function of the random variables X u X 2 , ..., X„) to 
represent 0. 


EXAMPLE 3 If X u ..., X„ is a random sample from a density /(•; 9), then 
(provided A' 1 , ..., X„ are observable) 

1 " 

x n = -Y.x, 

n it T 

is a statistic, and 

i{min [Aj, X n ] + max [X u ATJ} 

is also a statistic. If f(x; 9) = (j) Btl (x) and 9 is unknown, X„ - 9 is not a 
statistic since it depends on 9, which is unknown. //// 
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Next we shall define and discuss some important statistics, the sample 
moments. 

Definition 6 Sample moments Let X lt X 2 ,..., X n be a random sample 
from the density/(•)• Then the rth sample moment about 0, denoted by 
M ', is defined to be 

M' = -tx[. (3) 

n j = i 

In particular, if r = 1, we get the sample mean, which is usually denoted by 
X or X„; that is, 

X n = -i Xi- (4) 

n ; = i 

Also, the rth sample moment about X„ , denoted by M r , is defined to be 

M r = - £ (X, - XJ. (5) 

n ;=i 

//// 

Remark Note that sample moments are examples of statistics. //// 

We will consider in detail some properties of the sample mean in Sec. 3 
below. 

In Chap. II we defined the rth moment of a random variable X, or the rth 
moment of its corresponding density f x ( ), to be S[X r ] = p’ r . We could say 
that SIX'] is the rth population moment of the population with density f(x) = 
fx(x). We shall now show that the sample moments reflect the population 
moments in the sense that the expected value of a sample moment (about 0) 
equals the corresponding popuFation moment. Also, the variance of a sample 
moment will be shown to be ( l/n ) times some function of population moments. 
The implication is that for a given population the values that the sample moment 
assume will tend to be more concentrated about the corresponding population 
moment for large sample size n than for small sample size. Thus a sample 
moment can be used to estimate its corresponding population moment (provided 
the population moment exists). 

Theorem 1 Let X u X 2 ,..., X n be a random sample from a population 
with a density /(')• The expected value of the rth sample moment (about 
0) is equal to the rth population moment; that is, 

£[M' r ] = p' r (if p' r exists). 


( 6 ) 
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Also, 

var[M'] = 1 {£[X 2r ] - {S[X r ]) 2 } = ^ [ji' 2r - (p' r ) 2 ] (if/4 exists). (7) 

PROOF 

L" * = i J w ^ = i j 

/tj=i n i=i 

1 n 

var[M'] = var - £ AT- 

- [,?,«]-(j) 2 1, var ™ 

= Q 2 - (4*i]) 2 } = ~ {<f[* 2r ] - (/[XT) 2 } 

= 1 [#*2r - 04) 2 ]- 

In particular, if r = 1, we get the following corollary. //// 

Corollary Let A\, T 2 , ..., X n be a random sample from a density /(■), 
_ 1 n 

and let X. = - Y A': be the sample mean; then 
n ,= i 

<S\X„] =n and var [ X„] = - a 2 , (8) 

n 

where /t and a 2 are, respectively, the mean and variance of/(•). //// 

As we mentioned earlier, properties of the sample mean will be studied 
in detail in the next section. 

Theorem 1 gives the mean and variance in terms of population moments of 
the rth sample moment about 0; a similar, though more complicated, result 
can be derived for the mean and variance of the rth sample moment about the 
sample mean. We will be content with looking only at the particular case 

n 

r = 2, that is, at M 2 = (1 /«) £ (A) - J) 2 . M 2 is sometimes called the sample 

i ~ 1 

variancej although we will take Definition 7 as our definition of the sample 
variance. 
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Definition 7 Sample variance Let X u X 2 ,. . ., X n be a random sample 
from a density/(•); then 

S„ 2 = S 2 = —for n > 1 (9) 

ft 1 i= 1 

is defined to be the sample variance. Ill / 


The reason for taking S 2 rather than M 2 as our definition of the sample 
variance (both measure dispersion in the sample) is that the expected value of 
S 2 equals the population variance. 

The proof of the following remark is left as an exercise. 

Remark S -’= S ’' ^ X.j?,**-~ „„ 

Theorem 2 Let X u X 2 , ■ ■ ■, X n be a random sample from a density 
/(•), and let 

82 = Ai 

ft ~ 1 i = 1 

Then 

4§ 2 ] = a 2 and var [S 2 ] = ^/i 4 - f ~y cx 4 j f6rn>l. (10) 

proof (Only the first part will be proved.) Recall that a 2 
= &[{X — /<) 2 ] and p r — 6\i X — /i) r ]. We commence by noting and prov¬ 
ing an identity that is quite useful. 

E (*i ~ f0 2 = t (X t - X) 2 + n{X - p) 2 (11) 

since 

I (*I - A *) 2 = I (X; - X + X- p ) 2 = X [{X, - x) + (x- p )] 2 
= E [(*, - X ) 2 + 2 (X f - Z)(X- p) + (x-p) 2 ] 

= E (x, - X) 2 +2(X-p)1(X i - X) + n(X-p) 2 
= E (Xi — X) 2 + n{X — p) 2 . 
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Using the identity of Eq. (11), we obtain 

^ [S2] = ^[~iX (X, -*> 2 ] 

= g [£(*; - YY - n{x - /i) 2 j 

= - Y) 2 ] - nS[(X - nf ]j 



Although the derivation of the formula for the variance of S 2 can 
be accomplished by utilizing the above identity [Eq. (11)] and 

1 ^ 1 

x - fi= - X -Xi - - np 

n n 

n n n 

such derivation is lengthy and is omitted here only to be relegated to 
the Problems. //// 

Sample moments are examples of statistics that can be used to estimate 
their population counterparts; for example, M' r estimates /(', X estimates /i, 
and S 2 estimates a 2 . In each case, we are taking some function of the sample, 
which we can observe, and using the value of that function of the sample to 
estimate the unknown population parameter. 


3 SAMPLE MEAN 

The first sample moment is the sample mean defined to be 

1 " 

x=x„ = - Yx t , 

nt= i 

where X u X 2 , ..., X„ is a random sample from a density /(•)• A'is a function 
of the random variables X u ..., X„, and hence theoretically the distribution of 
X can be found. In general, one would suspect that the distribution of X 
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depends on the density/(•) from which the random sample was selected, and 
indeed it does. Two characteristics of the distribution of X, its mean and 
variance, do not depend on the density/(') per se but depend only on two charac¬ 
teristics of the density/(•). This idea is reviewed in the following subsection, 
while succeeding subsections consider other results involving the sample mean. 
The exact distribution of X will be given for certain specific densities/(•)• 

It might be helpful in reading this section to think of the sample mean X 
as an estimate of the mean // of the density /(•) from which the sample was 
selected. We might think that one purpose in taking the sample is to estimate 
p with X. 


3.1 Mean and Variance 

Theorem 3 Let X t , X 2 ,..., X n be a random sample from a density /(•), 

n 

which has mean p and finite variance a 2 , and let X = (1/n) £ X ; . Then 

i = 1 

<?[ X] = p% = p and var [X] = a\ = - a 2 . (12) 

n 

/III 

Theorem 3 is just a restatement of the corollary of Theorem 1. In light 
of using a value of X to estimate p, let us note what Theorem 3 says. S [ X] = p 
says that on the average X is equal to the parameter p being estimated or that 
the distribution of X is centered about p. var [A 7 ] = (l/n)cr 2 says that the 
spread of the values of X about p is small for a large sample size as compared to 
a small sample size. For instance, the variance of the distribution of X for a 
sample of size'2U is one-half the variance of the distribution of X for a sample of 
size 10. So for a large sample size the values of X (which are used to estimate 
/() tend to be more concentrated about p than for a small sample size. This 
notion is further exemplified by the law of large numbers considered in the next 
subsection. 


3.2 Law of Large Numbers 

Let/(-; 9) be the density of a random variable X. We have discussed the fact 
that one way to get some information about the density function /(•; 0) is to 
observe a random sample and make an inference from the sample to the popula¬ 
tion. If 9 were known, the density functions would be completely specified, 
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and no inference from the sample to the population would be necessary. There¬ 
fore, it seems that we would like to have the random sample tell us something 
about the unknown parameter 6. This problem will be discussed in detail in 
the next chapter. In this subsection we shall discuss a related particular 
problem. 

Let S[X\ be denoted by g in the density/(•). The problem is to estimate 
/!. In a loose sense, S[X] is the average of an infinite number of values of the 
random variable X. In any real-world problem we can observe only a finite 
number of values of the random variable X. A very crucial question then is: 
Using only a finite number of values of X (a random sample of size n, say), can 
any reliable inferences be made about <o[X], the average of an infinite number of 
values of XI The answer is “ yes ”; reliable inferences about <$[X] can be made 
by using only a finite sample, and we shall demonstrate this by proving what is 
called the weak law of large numbers. In words, the law states the following: A 
positive integer n can be determined such that if a random sample of size n or 
larger is taken from a population with the density /(•) (with S[X] = /(), the 
probability can be made to be as close to 1 as desired that the sample mean X 
will deviate from g by less than any arbitrarily specified small quantity. More 
precisely, the weak law of large numbers states that for any two chosen small 
numbers s and 6, where s > 0 and 0 < d < 1, there exists an integer n such that 
if a random sample of size n or larger is obtained from /(•) and the sample 
mean, denoted by X n , computed, then the probability is greater than 1 - <5 
(i.e., as close to 1 as desired) that X n deviates from /i by less than e (i.e., is ar¬ 
bitrarily close to g). In symbols this is written: For any e > 0 and 0 < <5 < 1 
there exists an integer n such that for all integers m > n 

^[| X m - g\ < e] > 1 - 5. 

The weak law of large numbers is proved using the Chebyshev inequality given 
in Chap. II. 

Theorem 4 Weak law of large numbers Let /(•) be a density with mean 
/( and finite variance a 2 , and let X„ be the sample mean of a random 
sample of size n from /(•). Let £ and 5 be any two specified numbers 
satisfying £ > 0 and 0 < <5 < 1. If n is any integer greater than a 2 /e 2 6, 
then 

P[-e< X n -g<s]> 1 -<5. (13) 

proof Theorem 5 in Subsec. 4.4 of Chap. II stated that P[g(X) > k\ 
< fi[g(X)]/k for every k > 0, random variable X, and nonnegative func¬ 
tion g( ). Equivalently, P[g(X) < k] > I - S [g(X)]lk. 
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Let g(X) = (X n - g) 2 and k = e 2 ; then 
P[-e < X n - g < s] = P[| X„ - g\ < £] 

= P[| X n -g \ 2 <£ 2 ] > 1 
(1 /n)a 2 


£[{x n - g) 2 ] 


= 1 - 


> 1-6 


for <5 > a 2 /ne 2 or n > cr 2 /£ 2 6. HU 

Below are two examples to illustrate how the weak law of large numbers 
can be used. 


EXAMPLE 4 Suppose that some distribution with an unknown mean has 
variance equal to 1. How large a sample must be taken in order that the 
probability will be at least .95 that the sample mean X n will lie within .5 
of the population mean? We have a 2 = 1, £ = .5, and 6 = .05; therefore 


1 


n > —r = 


bt 2 .05(.5) : 


= 80. 


illl 


EXAMPLE 5 How large a sample must be taken in order that you are 99 
percent certain that X„ is within ,5a of /*? We have £ = ,5cr and 6 = .01. 
Thus 


1 


n > -t-t = 


be 2 ,01(.5)V .01 (.5)' 


= 400. 


Illl 


We have shown that by use of a random sample inductive inferences to 
populations can be made and the reliability of the inferences can be measured in 
terms of probability. For instance, in Example 4 above, the probability that 
the sample mean will be within one-half unit of the unknown population mean 
is at least .95 if a sample of size greater than 80 is taken. 


3.3 Central-limit Theorem 

Although we have already stated the central-limit theorem in our study of 
distribution theory in Chap. V, we will repeat it here in our study of the sample 
mean X because it gives the asymptotic distribution of X. At the outset of this 



VI 


(^234 Nam 


PLING AND 


SAMPLING DISTRIBUTIONS 


section we indicated that we were interested in the distribution of X. The 
central-limit theorem, which is one of the most important theorems in all of 
probability and statistics, tells us approximately how X is distributed. 

Theorem 5 Central-limit theorem Let/( • ) be a density with mean /i and 
finite variance a 2 . Let X„ be the sample mean of a random sample of 
size n from /(•). Let the random variable Z„ be defined by 

z n = ^Md = l ( i4) 

\/var [ZJ <r/V n 

Then, the distribution of Z„ approaches the standard normal distribution 
as n approaches infinity. //// 

Theorem 5 tells us that the limiting distribution of Z„ (which is X„ stand¬ 
ardized) is a standard normal distribution, or it tells us that X n itself is ap¬ 
proximately, or asymptotically, distributed as a normal distribution with mean 
H and variance cr 2 /n. 

The astonishing thing about Theorem 5 is the fact that nothing is said 
about the form of the original density function. Whatever the distribution 
function, provided only that it has a finite variance, the sample mean will have 
approximately the normal distribution for large samples. The condition that 
the variance be finite is not a critical restriction so far as applied statistics is 
concerned because in almost any practical situation the range of the random 
variable will be finite, in which case the variance must necessarily be finite. 

The importance of Theorem 5, as far as practical applications are con¬ 
cerned, is the fact that the mean X„ of a random sample from any distribution 
with finite variance a 2 and mean /; is approximately distributed as a normal 
random variable with mean /< and variance a 2 In. 

We shall not be able to prove Theorem 5 because it requires rather ad¬ 
vanced mathematical techniques. However, in order to make the theorem 
plausible, we shall outline a proof for the more restricted situation in which 
the distribution has a moment generating function. The argument will be 
essentially a matter of showing that the moment generating function for the 
sample mean approaches the moment generating function for the normal 
distribution. 

Recall that the moment generating function of a standard normal dis¬ 
tribution is given by e it2 . (See Subsec. 3.2 of Chap. III.) Let m(t) = e*‘ . 
Let m Zn (t ) denote the moment generating function of Z„. It is our purpose to 
show that m Zn (t) must approach m(i) when n, the sample size, becomes large. 
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Now 


W z „(0 = £[e‘ Zn ] = £[exp tz „] = £ 


exp 





n r / t 

= ft* exp \ —j= 
■ = i L \Jn 



using the independence of X u .... X n . Now if we let Y, = (A",- - n)/a, then 
m y .(t), the moment generating function of Y it is independent of / since all Y t 
have the same distribution. Let m y (t) denote m y .(t ); then 


jr y ')] 


V* 


= [ ra '(is) 


v« 


Hence, 


«'z„(0 = 



(15) 


The rth derivative of m y (t/y/n ) evaluated at t = 0 gives us the rth moment about 
the mean of the density/(•) divided by (<r N /«) r , so we may write 



and since = 0 and /< 2 = <r 2 , this may be written 


m y 




t 2 + 


1 


3! Jn 


-^ 3 + 7 t 4 ^ + 


1 jU 4 


4!n ct 4 


I 


(17) 


Now lim (1 + u/ri) n = e' 1 ' , where u represents the expression within the paren- 

n~* oo 

theses in Eq. (17). We have lim m 7 Jt) = e*\ so that in the limit, Z„ has the 

/I “►00 

same moment generating function as a standard normal and, by a theorem 
similar to Theorem 7 in Chap. II, has the same distribution. 

The degree of approximation depends, of course, on the sample size and 
on the particular density /(•). The approach to normality is illustrated in 
Fig. 2 for the particular function defined by f(x) = e~ x / (0 x) (x). The solid 
curves give the actual distributions, while the dashed curves give the normal 
approximations. Figure 2 a gives the original distribution which corresponds to 
samples of 1; Fig. 2 b shows the distribution of sample means for n = 3; Fig. 2c 




FIGURE 2 


gives the distribution of sample means for n = 10. The curves rather exaggerate 
the approach to normality because they cannot show what happens on the tails 
of the distribution. Ordinarily distributions of sample means approach 
normality fairly rapidly with the sample size in the region of the mean, but more 
slowly at points distant from the mean; usually the greater the distance of a 
point from the mean, the more slowly the normal approximation approaches the 
actual distribution. 

In the following subsections we will give the exact distribution of the 
sample mean for some specific densities/(•). 

3.4 Bernoulli and Poisson Distributions 

If X u X 2 , ..., X„ is a random sample from a Bernoulli distribution, we can 
find the exact distribution of X n . (We know that X„ is approximately normally 
distributed.) The density from which we are sampling is 

Rx)=p \l - P y~ x r l0 ,Rx). 

n 

We know (see Example 9 of Chap. V) that £ X-, has a binomial distribution; 

i 

that is, 

p =fc]= QpV-*/ (0 .i,,#); 

hence, the distribution of X n is given by 

P X„ = - = for k = 0, 1, ..., n. 


(18) 
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So X n , the sample mean of a random sample from a Bernoulli density, takes on 
the values 0 , 1 /n, 2 jn, ..., 1 with respective binomial probabilities 






PR 


If X u ..., X n is a random sample from a Poisson distribution with mean 
A, then £ A"; also has a Poisson distribution with parameter nl (see Example 10 
of Chap. V), and hence 



e~ n \nX) k 
fc! 


for k = 0 , 1 , 2 ,, 


(19) 


which gives the exact distribution of the sample mean for a sample from a 
Poisson density. 


3.5 Exponential Distribution 

Let X u X 2 , •. •, X n be a random sample from the exponential density 

m = Oe-o^O'^ix). 

n 

According to Example 11 of Chap. V, ]T X- t has a gamma distribution with 

i 

parameters n and 0; that is, 

/xx ( (z) = T ^- ) z”- 1 eV-%, 00 )(z), 


or 


and so 


Or, 


p il X t < — z n ~ 1 0 n e~ Bz dz for y > 0 , 

X n <-] = f-'Pe-* dz for y > 0. 

nj J 0 T(n) * 


— r nx 1 

PlX n <x]= —-z n -'e n e- e *dz 
■>(? T(n) 

= f = 7 — (nu)"~ '0"e " eu n du ; 
J o T(n) 


that is, X n has a gamma distribution with parameters n and n6. 
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3.6 Uniform Distribution 

Let X u X„ be a random sample from a uniform distribution on the interval 
(0, 1]. The exact density of X n is given by 

/*„(*) = Z 0 GTTijj [w 1 - (I)(«* - i )"" 1 + Q( nx - 2 )"" 1 - 

+ ( — ^) ^)" 1 fk/n, (k + 1 )/»](*)• (20) 

The derivation of the above (using mathematical induction and the convolution 
formula) is rather tedious and is omitted. Instead let us look at the particular 
cases n = 1, 2, 3. 

/*,(*) =fxi x ) = Ao,i](*)> 

fx 2 (x ) = 2(2x)/ (0>t] (x) + 2[2x - 2(2x - l)]/ (iJ] (x) 

1 4x for 0 < x < } 

(4(1 — x) for \ < x < 1, 

and 

hfx) = \ (3x) 2 /( 0 , jj(x) + ^ (3x) 2 - Q(3 x - l) 2 / (i , S] (x) 

+ l[(3x) 2 - Q (3x - l) 2 + Q(3* - 2) 2 j/ (}jl] (x) 

f^-x 2 for 0 < x < } 

= 127[ I J a- — (x — i) 2 ] for$<x<i 
(-^-(1 - x) 2 for | < x < 1. 

/*,(*)> fx 2 (x), and fxjx) are sketched in Fig. 3, and an approach to normality 
can be observed. (In fact, the inflection points of fxfx) and of the normal 
approximation occur at the same points!) 

We have given the distribution of the sample mean from a uniform distri¬ 
bution on the interval (0, 1 ]; the distribution of the sample mean from a uniform 
distribution over an arbitrary interval (a, b ] can be found by transformation. 


3.7 Cauchy Distribution 

Let X u X„ be a random sample from the Cauchy density 


fix) = 


_ 1 _ 

nP{l + [(x - a)/p] 2 } ’ 
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then X„ has this same Cauchy distribution for any n. That is, the sample mean 
has the same distribution as one of its components. We are unable to easily 
verify this result. The moment-generating-function technique fails us since the 
moment generating function of a Cauchy distribution does not exist. Mathe¬ 
matical induction in conjunction with the convolution formula produces 
integrations that are apt to be difficult for a nonadvanced calculus student to 
perform. The result, however, is easily obtained using complex-variable 
analysis. In fact, if we had defined the characteristic function of a random 
variable, which is a generalization of a moment generating function, then the 
above result would follow immediately from the fact that the product of the 
characteristic functions of independent and identically distributed random 
variables is the characteristic function of their sum. A major advantage of 
characteristic functions over moment generating functions is that they always 
exist. 

r 

4 SAMPLING FROM THE NORMAL DISTRIBUTIONS 
4.1 Role of the Normal Distribution in Statistics 

It will be found in the ensuing chapters that the normal distribution plays a very 
predominant role in statistics. Of course, the central-limit theorem alone 
ensures that this will be the case, but there are other almost equally important 
reasons. 
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In the first place, many populations encountered in the course of research 
in many fields seem to have a normal distribution to a good degree of approxima¬ 
tion. It has often been argued that this phenomenon is quite reasonable in 
view of the central-limit theorem. We may consider the firing of a shot at a 
target as an illustration. The course of the projectile is affected by a great 
many factors, all admittedly with small effect. The net deviation is the net 
effect of all these factors. Suppose that the effect of each factor is an observa¬ 
tion from some population; then the total effect is essentially the mean of a set 
of observations from a set of populations. Being of the nature of means, the 
actual observed deviations might therefore be expected to be approximately 
normally distributed. We do not intend to imply here that most distributions 
encountered in practice are normal, for such is not the case at all, but nearly 
normal distributions are encountered quite frequently. 

Another consideration which favors the normal distribution is the fact 
that sampling distributions based on a parent normal distribution are fairly 
manageable analytically. In making inferences about populations from 
samples, it is necessary to have the distributions for various functions of the 
sample observations. The mathematical problem of obtaining these distribu¬ 
tions is often easier for samples from a normal population than from any other, 
and the remaining subsections of this section will be devoted to the problem of 
finding the distributions of several different functions of a random sample from 
a normally distributed population. 

In applying statistical methods based on the normal distribution, the 
experimenter must know, at least approximately, the general form of the distri¬ 
bution function which his data follow. If it is normal, he may use the methods 
directly; if it is not, he may sometimes transform his data so that the transformed 
observations follow a normal distribution. When the experimenter does not 
know the form of his population distribution, then he may use other more 
general but usually less powerful methods of analysis called nonparametric 
methods. Some of these methods will be presented in the final chapter of this 
book. 


4.2 Sample Mean 

One of the simplest of all the possible functions of a random sample is the 
sample mean, and for a random sample from a normal distribution the dis¬ 
tribution (exact) of the sample mean is also normal. This result first appeared 
as a special case of Example 12 in Chap. V. It is repeated here. 
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Theorem 6 Let X n denote the sample mean of a random sample of size n 
from a normal distribution with mean and variance a 2 . Then A'„ has 
a normal distribution with mean /1 and variance a 2 jn. 


proof To prove this theorem we shall use the moment-generating- 
function technique. 

T tY X/l 

m Xn (t) = ^[exp tX n ] = £ |exp —— J 

= ,ft 0 , exp [»' + 5 (?)1 

= exphut +- , 


which is the moment generating function of a normal distribution with 
mean // and variance a 2 In. //// 

Since we have the exact distribution of X n , in considering estimating /t 
with X„, we will be able to calculate, for instance, the (exact) probability that 
our “estimator” X„ is within any fixed amount of the unknown parameter /(. 


4.3 The Chi-Square Distribution 

The normal distribution has two unknown parameters /; and a 2 . In the previous 
subsection we found the distribution of X„, which “estimates” the unknown /(. 
In this subsection, we seek the distribution of 

s! ~i Oi - -m 

n — 1 ;= 1 

which “estimates” the unknown a 2 . A density function which plays a central 
role in the derivation of the distribution of S 2 is the chi-square distribution. 

Definition 8 Chi-square distribution If X is a random variable with 
density 

1 /1\* /2 
fx(<x) ~ r(k/2) (2) 


x k ' 2 ix I (0 , x) (x), 


( 21 ) 
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then X is defined to have a chi-square distribution with k degrees of freedom ; 
or the density given in Eq. (21) is called a chi-square density with k degrees 
of freedom, where the parameter k, called the degrees of freedom, is a posi¬ 
tive integer. //// 


Remark We note that a chi-square density is a particular case of a 
gamma density with gamma parameters r and A equal, respectively, to 
k/ 2 and Hence, if a random variable X has a chi-square distribution, 


and 


= ~ = k, 

var m - (W = 2k - 




r * r 2 _ r 

1 l i/2 

t 

i 

i 

1 — 2rJ ’ 

l 


( 22 ) 

(23) 

IIII 


Theorem 7 If the random variables X it i = 1,2, are normally and 

independently distributed with means g, and variances of, then 

. 2 


a -U L 7r)‘ 

has a chi-square distribution with k degrees of freedom. 

proof Write Z, = (AT,- — /i;)/^; then Z ; has a standard normal 
distribution. Now 

m v (t) = 4exp tU ] = <?[exp(t X Z?)j 

= & j^ Jlexp <Zfj = J]<^[exp tZf]. 

4exp tZf] = J“ e ' zl (-jY) e ~ izl dz 

= r i e -^ i - 2,)zi dz 


But 


j_ r J i- 2 f c -Fd- 

^/l - 2t ^ ® ^271 

1 , 1 

= - for t < - , 

Vl-2t 2 


2f)z 2 


dz 
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the latter integral being unity since it represents the area under a normal 
curve with variance 1/(1 — 2 1). Hence, 


n4exptZ?]= n-7=r 

i~ 1 i= 1 y/1 ~ 2t 



for t<\. 


the moment generating function of a chi-square distribution with k 
degrees of freedom. I III 


Corollary If X u ..., X n is a random sample from a normal distribution 

n 

with mean /; and variance a 2 , then U = Y (X t — n ) 2 /cr 2 has a chi-square 

i— 1 

distribution with n degrees of freedom. //// 


We might note that if either // or a 2 is unknown, the U in the above 
corollary is not a statistic. On the other hand, if /i is known and a 2 is unknown, 

we could estimate a 2 with (1/n) Y (X t — /<) 2 jnote that # | (1/n) £ (X t — //) 2 j = 


;=i 


1 


i=i 


i=i 


(1/n) £ &[(Xi - n) 2 ] = (1/n) £ <t 2 = <x 2 }, and find the distribution of 


(1/n) Y, (X t — /<) 2 by using the corollary. 

i=i 


Remark In words. Theorem 7 says, “the sum of the squares of inde¬ 
pendent standard normal random variables has a chi-square distribution 
with degrees of freedom equal to the number of terms in the sum.” //// 


Theorem 8 If Z 1( Z 2 , ..., Z„ is a random sample from a standard 
normal distribution, then: 

(i) Z has a normal distribution with mean 0 and variance 1/n. 

n 

(ii) Z and £ (Z £ — Z) 2 are independent. 

i= i 

n 

(iii) Y ( z i ~ Z) 2 has a chi-square distribution with n - 1 degrees 
i= i 

of freedom. 

proof (Our proof will be incomplete.) (i) is a special case of 
Theorem 6. We will prove (ii) for the case n = 2. If n = 2, 


2 


244 SAMPLING AND SAMPLING DISTRIBUTIONS 


VI 


and 


I( z,-z>’ = (z,-A±S) j + ( Zj _£l±S) j 

(Z t — Z 2 ) 2 (Z 2 -Z,) 2 

4 + 4 

_ (Z 2 - Z t ) 2 
2 

so Z is a function of Z, + Z 2 , and £ (Z/ - Z) 2 is a function of Z 2 — Z x ; 
so to prove Z and £ (Z ; — Z) 2 are independent, it suffices to show that 
Z, + Z 2 and Z 2 — Z, are independent. Now 

m Zl+Zj (l,) = 4*“ (Z,+Z2) ] = ^[e ,lZ ‘e“ Z2 ] = £[e“ z ‘]£[e‘' z >] 

= exp ^t 2 exp \t\ = exp r 2 , 

and, similarly, 

Wzj-z.Oj) = exp <!• 


Also, 

«Z 1 + Z„Z 1 -Z 1 (< 1 . <2) = 4 e ,,(Z,+Z2)+,2(Z2_Z,) ] 

= < g’[e (,1 -‘ 2 > z ‘e (,,+,2)Z2 ] = <g’[e (,,-,2)Z, ](f[e ( ' ,+ ' 2)Z2 ] 
= e i(,, “' 2 ) V « ,+ ' 2 ) 2 = exp tf exp tf 

= Wtz,+Z 2 (ll) m Z2-Zi(l2); 


and since the joint moment generating function factors into the product* 
of the marginal moment generating functions, Z I + Z 2 and Z 2 — Z t are 
independent. 

n 

To prove (iii), we accept the independence of Z and £ (Z ; — Z) 2 for 

1 

arbitrary n. Let us note that £ Z? = £ (Z, — Z + Z) 2 = £ (Z f — Z) 2 + 
2Z £ (Z ; - Z) + £ Z 2 = £ (Z; - Z) 2 + «Z 2 ; also £ (Z f - Z) 2 and nZ 2 
are independent; hence 

W IZ, 2 (0 = ,n I(Zi-Z) 2 (0 m nZ 2 (0" 

So, 


m I(Zi-Z) 2 (0 


m Ez , 2 (0 _ ( 1/(1 - 2 Q ) n/2 
™„ z 2 (0 ( 1/(1 - 20 )* 


1 - 2f 


(■- 1)/2 


*< 1/2 
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noting that yjn Z has a standard normal distribution implying that nZ 2 
has a chi-square distribution with one degree of freedom. We have 
shown that the moment generating function of £ (Z ; — Z) 2 is that of a 
chi-square distribution with n — 1 degrees of freedom, which completes 
the proof. IM 

Theorem 8 was stated for a random sample from a standard normal dis¬ 
tribution, whereas if we wish to make inferences about /( and er 2 , our sample 
is from a normal distribution with mean /( and variance er 2 . Let X lt ..., X„ 
denote the sample from the normal distribution with mean /i and variance er 2 ; 
then the Z, of Theorem 8 could be taken equal to (X t — /<)/er. 

(i) of Theorem 8 becomes: 

(i') Z = (1/n) £ (X c — it)lo = (X — n)/o has a normal distribution with 
mean 0 and variance 1/n. 

(ii) of Theorem 8 becomes: 

(ii') Z = (J-/i)/<t and Y.(Zi-Z) 2 = Y J [(X i -ji)/a-(X-n)/a] 2 = 
£ [(JVj - X) 2 /a 2 ] are independent, which implies X and £ (X t — X) 2 are 
independent. 

(iii) of Theorem 8 becomes: 

(in') £ (Z ; - Z) 2 = £ [(A'j - X) 2 /a 2 ] has a chi-square distribution with 
n — 1 degrees of freedom. 


Corollary If S 2 = [l/(n - 1)] £ (^< - %) 2 is the sample variance of a 

»= i 

random sample from a normal distribution with mean /< and variance a 2 , 
then 


has a chi-square distribution with n — 1 
proof This is just (iii'). 


U = 


(n — 1)S 2 


degrees of freedom. 


(24) 

III! 


Remark Since S 2 is a linear function of U in Eq. (24), the density of S 2 
can be obtained from the density of U. It is 


(«-i)/2 


1 


r[(n - l)/ 2 ] 


y(' , “3)/2 e -("-iW2» 2 / (0 , a)) (y). 


(25) 

llll 
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Remark The phrase “ degrees of freedom ” can refer to the number of 
independent squares in the sum. For example, the sum of Theorem 7 has 
k independent squares, but the sum in (iii) of Theorem 8 has only n — 1 
independent terms since the relation £ (Z t - Z) = 0 enables one to com¬ 
pute any one of the deviations Z, — Z, given the other n — 1 of them. //// 

All the results of this section apply only to normal populations. In fact, 
it can be proved that for no other distributions (i) are the sample mean and 
sample variance independently distributed or (ii) is the sample mean exactly 
normally distributed. 


4.4 The F Distribution 


A distribution, the F distribution, which we shall later find to be of considerable 
practical interest, is the distribution of the ratio of two independent chi-square 
random variables divided by their respective degrees of freedom. We suppose 
that U and V are independently distributed with chi-square distributions with 
m and n degrees of freedom, respectively. Their joint density is then [see 
Eq. (21)] 


/u.v(k, v) = r(w/2)r(;| l /2)2(m+ „ ) / i „ ( -«/V»-* ) / 2 e-«-+-)/ (0iW) ( l i)/ (0> 
We shall find the distribution of the quantity 


(26) 


X -Wn'\ <2?) 


---—V . 

which is sometimes referred to as the variance ratio. To find the distnbution 
of X, we make the transformation X = ( U/m)/(V/n ) and Y = V, obtain the 
joint distribution of X and Y, and then get the marginal distribution of X by 
integrating out the y variable. The Jacobian of the transformation is ( m/n)y ; so 


r , \ m 1 

fx • y) Z= Ti y T(ml2)T(nl2)2 ( 
and 

f x (x) = fx, r(*, >0 d y 

J A 


: m +")/ 2 i^ n 


( m - 2)/2 


,(n-2)/2 e ~i[(m/n)x)> + )>] 


[ _ /m \ m/2 (m _ 2)/2 f 00 (m + n-2)/2 -i[(m/n)x+l> j 

„/2)2 <m+ ")/ 2 U; Jo 3 ^ 


r(m/2)T( 

_ r[(m + ri)j2\ /m> m/2 


r[(m + n)/2] /m\ m " ^ (m ~ 2)/2 f ,, 

r(m/2)T(n/2) \ n / [1 + (w/n)x] (m+ " )/2 (0 ’.°°* h 


(28) 
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Definition 9 F distribution If A' is a random variable having density 
given by Eq. (28), then X is defined to be an F-distributed random variable 
with degrees of freedom m and n. Hll 

The order in which the degrees of freedom are given is important since the 
density of the F distribution is not symmetrical in m and n. The number of 
degrees of freedom of the numerator of the ratio m/n that appears in Eq. (28) 
is always quoted first. Or if the /'-distributed random variable is a ratio of two 
independent chi-square-distributed random variables divided by their respective 
degrees of freedom, as in the derivation above, then the degrees of freedom of the 
chi-square random variable that appears in the numerator are always quoted 
first. 

We have proved the following theorem. 

Theorem 9 Let U be a chi-square random variable with m degrees of 
freedom; let V be a chi-square random variable with n degrees of freedom, 
and let U and V be independent. Then the random variable 

V/n 

is distributed as an F distribution with m and n degrees of freedom. The 
density of X is given in Eq. (28). an 

The following corollary shows how the result of Theorem 9 can be useful 
in sampling. 

Corollary If X t , ..., X m + l is a random sample of size m + 1 from a 
normal population with mean g x and variance a 2 , if E„ ..., Y n+i is a 
random sample of size n - 1-1 from a normal population with mean and 
variance a 2 , and if the two samples are independent, then it follows 

m+ 1 

that (1 /ct 2 ) Yj (Xi~ X) 2 is chi-square distributed with m degrees of 
i 

n + 1 

freedom, and (1 /<t 2 ) £ (Yj ~Y) 2 is chi-square-distributed with n degrees 
i 

of freedom; so that the statistic 

£(*.-- X) 2 /m 
Z(Yj-Y) 2 / n 

has an /’distribution with m and n degrees of freedom. 


an 
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We close this subsection with several further remarks about the F dis¬ 
tribution. 

Remark If X is an F-distributed random variable with m and n degrees 
of freedom, then 


<?[X] = 


n — 2 


for n > 2 


and 


var [JT] = 


2 n 2 (m + n — 2) 
m(n - 2) 2 (n - 4) 


for n > 4. 


(29) 


proof At first it might be surprising that the mean depends only 
on the degrees of freedom of the denominator. Write X as in Eq. (27); 
that is, 

v U/m 


V/n’ 


then 


But <?[(/] = m by Eq. (22), and 

[v\ r(n/2) \2/ Jo V 

1 / 1 \ "I 2 r 00 

= —— (-) f v^^e-^dv 

r(u/2) \2/ Jo 

r[(n - 2)/2]/l\ n/2 /l\“ (n - 2)/2 1 


T(n/2) 




n — 2 


and so 


m] 






m 


n — 2 n — 2 


The variance formula is similarly derived. 


Ill / 


Remark If X has an F distribution with m and n degrees of freedom, 
then \/X has an F distribution with n and m degrees of freedom. This 
result allows one to table the F distribution for the upper tail only. For 
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example, if the quantile £. 95 is given for an F distribution with m and n 
degrees of freedom, then the quantile £’os f° r an F distribution with n and 
m degrees of freedom is given by l/^.ss • general, if X has an F dis¬ 
tribution with m and n degrees of freedom and Y has an F distribution 
with n and m degrees of freedom, then the pth quantile point of X, £ p , 
is the reciprocal of the (1 — />)th quantile point of Y, f! _ p , as the following 
shows: 

p = n* * « = y - r[r * ^ ^ fj- 

but 

1 -p = P[Y<Z, i_J; 


so 

Z\- P = y- HU 

Remark If X is an /^-distributed random variable with m and n degrees 
of freedom, then 

_mXJn_ 

1 + mX/n 

has a beta density with parameters a = m/2 and b = n/2. //// 

N 

4.5 Student’s t Distribution 

Another distribution of considerable practical importance is that of the ratio of 
a standard normally distributed random variable to the square root of an in¬ 
dependently distributed chi-square random variable divided by its degrees of 
freedom. That is, if Z has a standard normal distribution, if U has a chi- 
square distribution with k degrees of freedom, and if Z and U are independent, 
we seek the distribution of 


The joint density of Z and U is given by 


fz, t/(z> n) = 


1 1 / 1\* /2 

y/2n IW2)W 


X = 


Z 

s/ufk 




(30) 
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If we make the transformation X = Z/y/u/k and Y = U, the Jacobian is 
Jy/k, and so 


oo 

fx, y( x > y) 

— 00 


fx(x) = j 


1 1 / 1\‘/ 2 

yflkn P(/c/2) \2/ 



>‘/2-i+i e -ia+x^/k). 


dy 


r [(k +1)/2] i i 

r(k/2) yjbia +r 2 /fc/ k+1)/2 ' 


(31) 


Definition 10 Student’s t distribution If X is a random variable having 
density given by Eq. (31), then X is defined to have a Student’s t distribu¬ 
tion, or the density given in Eq. (31) is called a Student’s t distribution 
with k degrees of freedom. Illl 

We have derived the following result. 

Theorem 10 If Z has a standard normal distribution, if U has a chi- 
square distribution with k degrees of freedom, and if Z and U are inde¬ 
pendent, then Z/y/u/k has a Student’s t distribution with k degrees of 
freedom. //// 


The following corollary shows how the result of Theorem 10 is applicable 
to sampling from a normal population. 


Corollary If X u ..., X n is a random sample from a normal distribution 
with mean ji and variance a 2 , then Z = (X — fi)/(o/y/n) has a standard 
normal distribution and U = (1/cr 2 ) £ (T ; - X) 2 has a chi-square distribu¬ 
tion with n — 1 degrees of freedom. Furthermore, Z and U are inde¬ 
pendent (see Theorem 8); hence 

(X - ji)/(a/y/n) _ = y/n(n - 1)(T- g) 

VO/* 2 ) E (X,- XfKn - 1)“ s/Z(Xi- X) 2 

has a Student’s t distribution with n — 1 degrees of freedom. Illl 
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We might note that for one degree of freedom the Student’s t distribution 
reduces to a Cauchy distribution; and as the number of degrees of freedom 
increases, the Student’s t distribution approaches the standard normal distribu¬ 
tion. Also, the square of a Student’s t-distributed random variable with k 
degrees of freedom has an F distribution with 1 and k degrees of freedom. 

Remark If A" is a random variable having a Student’s t distribution with 

k degrees of freedom, then 

S[X] = 0 if k > 1 and var [X] = k/(k -2) if k > 2. (32) 

proof The first two moments of X can be found by writing 

X = ZjJ Ujk as in Theorem 10 and using the independence of Z and U. 

The actual derivation is left as an exercise. //// 

This completes Sec. 4 on sampling from the normal distribution. Note 
that we considered the distribution of functions of only two different statistics, 
namely, the sample mean and sample variance. In the next chapter we will find 
that these two statistics are the only ones of interest in sampling from a normal 
distribution; they will turn out to be sufficient statistics. 


5 ORDER STATISTICS 

'A 


S.l Definition and Distributions 

In Subsec. 2.4 we defined what we meant by statistic and then gave the sample 
moments as examples of easy-to-understand statistics. In this section the 
concept of order statistics will be defined, and some of their properties will be 
investigated. Order statistics, like sample- moments, play an important role 
in statistical inference. Order statistics are to population quantiles as sample 
moments are to population moments. 

Definition 11 Order statistics Let X t , X 2 , ..., X„ denote a random 
sample of size n from a cumulative distribution function F(-). Then 
Yi < Y 2 < ■ ■ ■ < Y„, where Y t are the X t arranged in order of increasing 
magnitudes and are defined to be the order statistics corresponding to 
the random sample X,, X „. //// 
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We note that the Y i are statistics (they are functions of the random sample 
X u X 2 , ..., X„) and are in order. Unlike the random sample itself, the order 
statistics are clearly not independent, for if Yj > y, then Y j+l > y. 

We seek the distribution, both marginal and joint, of the order statistics. 
We have already found the marginal distributions of Y\ = min [X u X n ] and 
Y n = max [X l9 ..., X n ] in Chap. V. Now we will find the marginal cumulative 
distribution of an arbitrary order statistic. 

Theorem 11 Let Y l < Y 2 < • • • < Y„ represent the order statistics from 
a cumulative distribution function F( ). The marginal cumulative distri¬ 
bution function of Y a , a = 1, 2, ..., n, is given by 

FrSy) = t ( n W)] J [l - F(y)r j (33) 

J=* \J/ 

proof For fixed y, let 

= / ( _ 00l) ,](A' i ); 

then 


£ Z ; = the number of X t < y. 

i = 1 
n 

Note that ^ has a binomial distribution with parameters n and F(>>). 

i = 1 

Now 

Fyjy) = P[Y«<y] = P|X Z i >cc]= Q)[F(y)F[l - F(y)T~ J . 

The key step in the proof is the equivalence of the two events { Y x < y} 
and > a}. If the ath order statistic is less than or equal to y, then 
surely the number of X t less than or equal to y is greater than or equal to 
a, and conversely. //// 


Corollary F y „(y) = £ (” j [F(y)] j [l - F(y)] n J = [F(y)f, 
and 


= i QimVil - F(y)Y~ J = 1 - [1 - F(y)T. 


FyM 


III! 
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Theorem 11 gives the marginal distribution of an individual order statistic 
in terms of the cumulative distribution function F(-). For the remainder of 
this subsection, we will assume that our random sample X u ..., X„ came from a 
probability density function /"(•); that is, we assume that the random variables 
X t are continuous. We seek the density of Y x , which, of course, could be 
obtained from Eq. (33) by differentiation of F Y J[y). Note that 

frSy) 

= ii m Fr ^ y + A - y ) ~ ^y.00 = lim p [y < Y x < y + Ay] 

Ay-,o A y Ay-o A y 

.. P[(a - l)of the X t < y; one X ( in(y,y + Ay];(n - a) of the X t > y + Ay] 
= lim —-— -—— - — - 

Ay-0 A y 

«! [F(y)f-‘[F(y + Ay) - F(y)][\ - F(y + Ay)]— | 

(a — 1) !1 !(m — a)! Ay j 

We have made sensible use of the multinomial distribution. Similarly, we can 
derive the joint density of Y a and Y f for 1 < a < < n. 

/y «, J) Ax Ay « P[x < Y a < x + Ax; y < Y p < y + Ay] 

« F[(a — 1) of the Xi < x; one X t in (x, x + Ax]; 

(P — a — 1) of the Xi in (x + Ax, y]; 
one X t in (y, y + Ay];(« — /?) of the X t > y + Ay] 
n\ 

~(a— 1)!1!(/J —a — 1)! 1! (n — yS)! 
x [F(x)f _1 [F(y) - F(x + Ax)] /i ' a ' 1 [l - F(y + Ay)]"-y(x) Ax/(y) Ay; 

hence 

/r«, y/x, y) = 

for x < y, and 



/y«, y,(x, f) = 0 for x > y. 



254 SAMPLING AND SAMPLING DISTRIBUTIONS 


VI 


In general, f Yl .y„Oi, • • •, yj 

= lim ~— P|>i < Y t < y, + Aj>!;...; y„ < Y„ < y„ + AyJ 

A ^° ha* 

< = i 

= lim —- P[oneX i in( 3 ; 1 , 3 ; 1 + Ay,];... ; oneAT ( in (y„,y n + Ay„]] 

A ^° F[A y t 

i= 1 

= lim — [F(yi + Ay,) - F(y,)].[F(y n + Ay„) - F(y„)] 

l\Ay t 

1=1 

= n\f(y i). f(y n ) for y 2 < y 2 < ••• < y„, 

and f Yl . rjju • • •, J„) = 0, otherwise. 

We have derived the following theorem. 

Theorem 12 Let Aj, X 2 , ..., X„ be a random sample from the prob¬ 
ability density function /(•) with cumulative distribution function /’(•). 
Let F, < Y 2 < ■ ■ ■ < F„ denote the corresponding order statistics; then 

A.OO = (g _ 1 a)! tfW‘[i - F(y)r x f(y); (34) 

m! 

fr„ y„(x, y) = (a _ 1)!(j g_ a _ l)t ( n — p)\ 

x [F(x)] a_1 [T(>’) - F(x)Y~ a ~ l 

x [1 - F(y)r'/(x)/(y)/ (x> ^(y); (35) 

fy t . Y„(yi’ • • • > Jn) 

= |« ! /(Li). f(y n ) for yi < y 2 < • • • < y„ (36) 

\o otherwise. //// 

Any set of marginal densities can be obtained from the joint density 
f Yi Yn (y i> • • •, y„) by simply integrating out the unwanted variables. 


5.2 Distribution of Functions of Order Statistics 

In the previous subsection we derived the joint and marginal distributions of the 
order statistics themselves. In this subsection we will find the distribution of 
certain functions of the order statistics. One possible function of the order 
statistics is their arithmetic mean, equal to 


1 
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n n 

Note, however, that (l/ri) Y Yj = (1/n) £ A, the sample mean, which was the 

j= i ‘ - =i 

subject of Sec. 3 of this chapter. We define now some other functions of the 
order statistics. 

Definition 12 Sample median, sample range, sample midrange Let 

Yi <■■■<, Y n denote the order statistics of a random sample X u X„ 
from a density /(•). The sample median is defined to be the middle order 
statistic if n is odd and the average of the middle two order statistics if n 
is even. The sample range is defined to be Y n — T 1; and the sample mid¬ 
range is defined to be (Y, 4- Y n )/2. //// 

'jJi Me 

If the sample size is odd, then the distribution of the sample median is 
given by Eq. (34); for example, if n = 2k + 1, where k is some positive integer, 
then Y k+1 is the sample median whose distribution is given by Eq. (34). If the 
sample size is even, say n = 2k, then the sample median is (y k + Y k+i )/2, the 
distribution of which can be obtained by a transformation starting with the 
joint density of Y k and T k + 1 , which is given by Eq. (35). 

We derive now the joint distribution of the sample range and midrange, 
from which the marginals can be obtained. 

By Eq. (35), we have 

A,, r„(*, A = n(n - l)[F(y) - F(x)] n ~ 2 f{x)f{y) for x < y. (37) 

Make the transformation R = Y„ — Y t and T={Y i + Y„)/2 or r = y — x and 
t = (x + y)/2. Now x = i — r/2, and y = t + r/2; hence 

dx dx 

J= Tr Yt 
dy dy 
dr dt 

and we obtain Theorem 13. 

Theorem 13 If R is the sample range and T the sample midrange from 
a probability density function, then their joint distribution is given by 

/R,r(f) 0 = 

n(n - l)[F(t + r/2) - F(t - r/2)]"~ 2 /(f - r/2)f(t + r/2) for r > 0, (38) 
and the marginal distributions are given by 

/ °° /•<» 

/r, r(/> 0 dt and f T (t) = f RT (r,t)dr. 

-00 •'o’ 



(39) 

IIII 
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EXAMPLE 6 Let X u X n be a random sample from a uniform distribu¬ 
tion on (ji - yfia, n + x 3d). Here n is the mean, and a 2 is the variance 
of the sampled population. 


and 


•A*) 2^/3 m + 


Fix)- 


/ 1 H — y/3a\ 

~ wTc x ~ lylT.w*) + 7 <* 


+ i/3o, oo)C^0* 


f RM r, 0 = "X^} r .-, r 


(2^30 y 


(X-V3ff + r/2 > ^ + v /3»-r/2)(0Ao. 2V3a)( r )- 
_ "(« ~ !) 


(40) 


M r ) = //*. r(r, 0 d< = r"- 2 (Z/3<x - r)/ (0 , 2V - 3ff) (r). (41) 

We note that f R (r ) is independent of the parameter //. 

/r(0 = J/r. r(r, 0 dr 

n (jl — 1) „mta[2f-20»-,/3ff). 2 (ji + V3ff)- 2 |] 


r 

/ 2rr V' - ,. 


(2\/3ff)" J 0 

which simplifies to 

/rW= 2^(73; 


r n ~ 2 dr •/, 


(ji-V3 <t,/4 + V3<t; 


)( 0 , 


•)' 


n — 1 

+ 1| A-1.0) 


A 


\ 7 tJ + 


From Eq. (41), we can derive S[K\ = 2^/3a(n — 1 )/(n + 1). 


(42) 

III! 


Certain functions of the order statistics are again statistics and may be used 
to make statistical inferences. For example, both the sample median and the 
midrange can be used to estimate n, the mean of the population. For the uni¬ 
form density given in the above example, the variances of the sample mean, the 
sample median, and the sample midrange are compared in Problem 33. 


5.3 Asymptotic Distributions 

In Subsec. 3.3, we discussed the asymptotic distribution of the sample mean 
X n . We saw that X n was asymptotically normally distributed with mean /( and 
variance a 2 jn. We now consider the question: Is there an asymptotic dis¬ 
tribution for the sample median ? We will state (without proof) a more general 
result 
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Since for asymptotic results the sample size n increases, we let 1"! ) 5S 
Y ( 2 n) < ■ • ■ < Y ( n n) denote the order statistics for a sample of size n. The super¬ 
script denotes the sample size. We will give the asymptotic distribution of that 
order statistic which is approximately the (np) th order statistic for a sample of 
size n for any 0 <p < 1. We say “approximately” the (np)th order statistic 
because np may not be an integer. Define p n to be such that np„ is an integer 
and p n is approximately equal to p ; then Y^ n is the (np„)th order statistic for a 
sample of size n. (If X t ,..., X n are independent for each positive integer n, we 
will say X t ,..., X n , ... are independent.) 

Theorem 14 Let X 1? ..., X„, .. . be independent identically distributed 
random variables with common probability density /(•) and cumulative 
distribution function F(-). Assume that F(x) is strictly monotone for 
0 < F(x) < 1. Let be the unique solution in x of F(x) = p for some 
0 < p < 1. is the pth quantile.) Let p„ be such that np„ is an integer 
and n|p„—p| is bounded. Finally, let Y^ denote the (np „)th order 
statistic for a random sample of size n. Then Y„ ( ”’ n is asymptotically 
distributed as a normal distribution with mean £ p and variance 

p(i-p)/nm„)] 2 - //// 

EXAMPLE 7 Let p = i; then £ p is the population median, and Theorem 14 
states that the sample median is asymptotically distributed as a normal 
distribution with mean the population median and variance l/4n[/(^ 1/2 )] 2 . 
In particular, if /(•) is a normal density with mean p and variance a 2 , 
then the sample median is asymptotically normally distributed with 
mean p and variance 1 /4n[f(p)] 2 = na 2 /2n. Recall that the sample mean 
is normally distributed with mean p and variance a 2 /n. //// 

In Theorem 14 above we considered a certain kind of asymptotic distribu¬ 
tion of order statistics. We will now consider yet another kind. In the above 
we looked at the asymptotic distribution of that order statistic which was 
approximately the (np)th order statistic for a sample of size n. Such an order 
statistic had (approximately) lOOp percent of the n observations to its left. 
That is, its relative position remained unchanged as n, the sample size, in¬ 
creased ; it always had (approximately) the same percentage of the n observations 
to its left. We will now consider the asymptotic distribution of that order 
statistic whose absolute position remains unchanged. That is, we con¬ 
sider the asymptotic distribution of, say, Y ( k n) for fixed k and increasing n. 
Y k n) is the kth smallest order statistic for a sample size n > k, and k remains 
fixed. In order to make the presentation somewhat simpler, we will take k = 1, 
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in which case Yj n) is the smallest of the n observations. We note that we could 
just as well consider the kth largest order statistic, namely Y^- k+u which for 
k = 1 specializes to Y ( „ n> , the largest order statistic for a sample of size n. Either 
F| n) or Y ( n n> is often referred to as an extreme-value statistic. 

Practical applications of extreme-value statistics are many. The old 
adage that a chain is no stronger than its weakest link provides a simple example. 
If Xi denotes the “strength” of the /th link of a chain with n similar links, then 
y, ( " ) = min [JV,, ... , X n ] is the “ strength ” of the chain. Also, in measuring the 
results of certain physical phenomena such as floods, droughts, earthquakes, 
winds, temperatures, etc., it can be seen that under certain circumstances one is 
more interested in extreme values than in average values. For instance, it is 
the extreme earthquake or flood, and not the average earthquake or flood, that 
is more damaging. We can see that results, whether exact or asymptotic, for 
extreme-value statistics can be just as important as results for averages. 

For the most part we will concentrate on finding the asymptotic distribu¬ 
tion of F„ ( ">. One might wonder why we should be interested in an asymptotic 
distribution of F„ (n) when the exact distribution, which is given by F Yn( „,(y) = 
[/■(>')]", where F(-) is the c.d.f. sampled from, is known. The hope is that we 
will find an asymptotic distribution which does not depend on the sampled 
c.d.f. F( ). We recall that the central-limit theorem gave an asymptotic dis¬ 
tribution for X„ which did not depend on the sampled distribution even though 
the exact distribution of X n could be found. 

In searching for the asymptotic distribution of Y { n "\ let us pattern our 
development after what was done in deriving the asymptotic distribution of X n . 
According to the law of large numbers, X„ has a degenerate limiting distribution; 
that is, the limiting c.d.f. of X„ is the cumulative distribution that assigns all its 
mass to the point g. Such a limiting distribution is not useful if one intends to 
use the limiting distribution to approximate probabilities of events since it 
assigns each event a probability of either 0 or 1. To circumvent such difficulty, 
we first “centered” the values of X n by subtracting g, and then we “inflated” 
the values of X n — g by multiplying them by x 'n/a, and, consequently, we were 
able to get a nondegenerate limiting distribution; that is, according to the central- 
limit theorem, v /n( X„ — g)/o had a standard normal distribution as its limiting 
distribution. A general procedure, when one is looking for a limiting distribu¬ 
tion of, say, Z„, is to first “ center” the Z„ by subtracting a constant, say a n , and 
to then “ scale” Z„ - a n by dividing by another constant, say b„. In the case of 
the central-limit theorem, Z n = X n , a„ = g, and b„ = cr/^/n. In the case of 
Theorem 14 above, Z„ = Y n ( "» , a n = £ p , and b n = Jp{\ - p)/n[M„)] 2 . For both 
of these two cases the sequence of constants {aj did not depend on n. In the 
case at hand, namely when Z n = Y^ n \ the sequence of constants {a n } is likely to 
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depend on n since Y^ n> tends to increase with n. Let us look at a couple of 
examples. 


EXAMPLE 8 Consider sampling from the logistic distribution; that is, 
P(x) = (l +e -*)-i Find t he limiting distribution of (Y ( n n> - a n )/b„. 
There are two problems: First, what should we take the sequences of 
constants {a n } and {b„} to be? And, second, what is the limiting distribu¬ 
tion of ( Y ( n n) — a n )/b n for the selected constants {a n } and {b n } ? It seems 
reasonable that the “centering” constants {a„} should be close to $[ 
so we seek an approximation to <f[y< n) ]. Now F(X t ), ..., F(X„) is a 
random sample from the uniform distribution over (0, 1); hence F( Y^) 
is the largest of a sample of size n from a uniform distribution over (0, 1). 
That<?[P( y„<">)] = w/(w + l)can then be routinely derived. Now F($[ Y ( n n> ]) « 
£[F(Y^)] or 

F(S[ *<">]) = {1 + exp (- g[ y<">])} 1 
= 1 -{1 +exp (4F< n >])}- 1 
«1 — (n + l) -1 


n 


= S[F(Y^], 


which implies that 
or that 


n k exp (^[y‘ n) )] 
^[y‘ n) ] «log e n. 


Finally, since &[ y< n) ] ta log n (from here on we use log n for log,, n), a 
reasonable choice for the sequence of “centering” constants {a n } seems 
to be the sequence {log n}. We are seeking 


limP 




= limP 


y(n) 


- log n 

bn 



= lim P[y<"> ^b n y + log n] 

n~* oo 


= lim [F(b n y + log n)]" 


= lim(l + e~ bnr ~ tot ")~ n 

n~*oo 

= lim(l + (l/ii)e~ bny )~ n 

n~* oo 


= exp( - e y ) for b„= 1. 
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Hence, if {a„} and {b„} are selected so that {a n } = {log n} and {b„} = {1}, 
respectively, then the limiting distribution of ( Y ( n n) - a n )/b n = Y ( n n> - log n 
is exp (—e -y ). //// 


EXAMPLE 9 Consider sampling from the exponential distribution so that 
F( x ) = (1 — e >x )I(o, oo)( x )- Again, let us find the limiting distribution of 
( Y ( n n) - a n )/b n . As in Example 8, «f [F( y< n >)] = n /(n + 1) = 1 - l/(n + 1). 
Now 


and 


so 


or 


/K[y<">])= i -exp{-A4y(")]}, 
F{S[Y^]) k S[F{Y^y, 

exp {-XS[Y™]}, 


n + 1 


i1 n) ] % j log (n + 1) % j log n. 


Hence, it seems reasonable to use a„ = (1/A) log n. 

r y, 


lim P 

n~* oo L 


1 br* y \='™ r [ v --\' 0in - b A 

= lim ^F(b n y +|lognjl 
= lim(l - e -^ny-iogny 

n~*oo 

= lim (l - - 

n -oo\ n / 


= exp ( — e y ) for 


Hence the limiting distribution of ( Yj, n) — a n )/b n = [ Y ( n n> - (1/A) log n]/(l/A) 
is exp ( — e~ y ). We note that we obtained the same limiting distribution 
here as in Example 8. Here we were sampling from an exponential 
distribution, and there we were sampling from a logistic distribution. 


//// 
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In each of the above two examples we were able to obtain the limiting 
distribution of — d n )/b„ by using the exact distribution of Y^ n) and ordinary 
algebraic manipulation. There are some rather powerful theoretical results 
concerning extreme-value statistics that tell us, among other things, what limiting 
distributions we can expect. We can only sketch such results here. The 
interested reader is referred to Refs. 13, 30, and 35. 

Theorem 15 Let X u ..., X n , ... be independent and identically dis¬ 
tributed random variables with c.d.f. F(-). If ( Y ( n n> — a n )lb n has a limiting 
distribution, then that limiting distribution must be one of the following 
three types: 

G t (y; 7) = o, oo/y), where y > 0. 

G 2 (y; y) = + /[o,«o)00> where y > 0. 

G 3 0) = expf-e'O //// 

Theorem 15 states what types of limiting distributions can be expected. 
The following theorem gives conditions on the sampled F(- ) that enable us to 
determine which of the three types of limiting distributions correspond to the 
sampled F(-). 

Theorem 16 Let X x , ..., X n , ... be independent and identically dis¬ 
tributed random variables with c.d.f. F(-). Assume that (T* n) — a„)/b„ 
has a limiting distribution. The limiting distribution is: 

(i) G’i(*; y) if and only if 

1 - F(x) 

' lm 1 ™^ for every t > 0. 
x-oo 1 - F(rx) 

(ii) G 2 ( ■; y) if and only if there exists an x 0 such that 
F(x 0 ) = 1 and F(x 0 — s) < 1 for every £ > 0. 
and 


lim 

0<x->0 


1 - F(x 0 - rx) _ y 
1 - F(x 0 - x) 


for every x > 0. 


(iii) G 3 ( ) if and only if 

lim n[l — F(P n x + a„)] = e~ x 


for each x, 
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where 

= infjz: n -~- < F(z)j 

and 

Pn = inf i z: 1 - (ne)~ l < F(a n + z)}. //// 

Note that if F() is strictly monotone and continuous, then a n is given by 
H<x n ) = (n- 1 )/n, or <x„ = F~\{n - 1 }/n); and ft, is given by F(a n + p n ) = 1 - 
(m0~\ or ft, = F~\ 1 - {ne}-') - «„ = F~\l - {ne}-') - F-‘({n - l}/n). 


EXAMPLE 10 Take F(x) = (1 — e' Xx )I (0t ^fx) as in Example 9. a„ is such 
that F( a„) = (n — 1 )/n or 1 — e~ Xc,n = 1 — 1/n, which implies that oc„ = 
(1/A) log n. P„ is such that F(a n + ft,) = I — ( ne)~ l or 1 — e~ X(i/X) logXfn 
= 1 -(ney 1 , or ft, = 1/A. 

lim n[l — F(P„x + a„)] = lim n(e _ '* (/, '’* +t '’ ,) ) = e ~ x for each x, 

n~* oo n -»oo 

so, as we saw in Example 9, the exponential distribution has G 3 ( •) as its 
corresponding limiting extreme-value distribution. //// 


EXAMPLE 11 Take F(x) = F(x; y) = [1 — (1 — x)’']/ (0 ,, ,(x) + f [t _ X) (x). Note 
that for x 0 = 1, F(x 0 ) = 1 and F(x 0 — s) < 1 for every £ > 0. Also, 

Jim | im (!-». + «>’ = 

o<x-»o 1 F(xq — x) o<x-*o (1 — Xq -I- x) y 

so F(- ; y) has G 2 (-; y) as its limiting extreme-value distribution. //// 


EXAMPLE 12 Take F(x) = F(x; y) the c.d.f. of a t distribution with y degrees 
of freedom. 


1 ~ F(x) 
1 - F(xx) 


lim 


fix) 

t/(tx) 


[i + (T*) 2 /y ] (v+1 

t( 1 + x 2 /y) (r + 1)12 


so the t distribution with y degrees of freedom has <j,( •; y) as its limiting 
extreme-value distribution. III! 
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Theorem 16 gives conditions on the sampled c.d.f. T( ■) that enable us to 
determine the proper limiting extreme-value distribution for (— a n )/b n . The 
theorem does not tell us what the constants {a n } and {b n } should be. If, how¬ 
ever, the conditions for the third type are satisfied, then we have 

n[l - F(P„x + a„)] -> e~ x , as n-> oo, 

and 




->exp( — e x ). 


Now, P[( Y^> - a„)/b„ <x] = [F(b n x + a„)] n \ hence 


[F(b n x + a„)]" -► exp (- e x ), 


or 

n \og F{b n x + d n ) > —e 


or 

«[1 ~ F(b n x + a n )]^> e~ x \ 

and we see that a„ can be taken equal to a„ and b n — [S n . Thus, for the third 
type the constants {a„} and {b„ \ are actually determined by the condition for that 
type. We shall see below that for certain practical applications it is possible to 
estimate { a n } and {b„}. 

Since the types (/,(•; y) and G 2 ( ", y ) both contain a parameter, it can be 
surmised that the third type G 3 ( ■) is more convenient than the other two in 
applications. Also, G 3 (y) = exp ( —e - ^ is the correct limiting extreme-value 
distribution for a number of families of distributions. We saw that it was 
correct for the logistic and exponential distributions in Examples 8 and 9; it is 
also correct for the gamma and normal distributions. What is often done in 
practice is to assume that the sampled distribution F( ) is such that exp ( — e y ) 
is the proper limiting extreme-value distribution; one can do this without assum¬ 
ing exactly which parametric family the sampled distribution F(-) belongs to. 
One then knows that P[( T<"> - a„)/b n < y] -» exp (- for every y as n -> oo. 
Hence, 

ry(n) _ a -] 

p rrr"- y ] * ex p 

for large fixed n. Or 

P[T„ (n) <a n + b n y\ k exp ( -e~ y ), 
or 

P[Y ( n n) < z] « exp( —e _(z_a " )/i ’"). 
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It is true that a n and b n are given in terms of the (1 - l/n)th quantile and the 
(1 — l/ne)th quantile of the sampled distribution; however, for certain applica¬ 
tions they can be estimated, in which case we would have an approximate distri¬ 
bution for Y<”\ a distribution that is valid for a variety of different distributions 
that could be sampled from. (One might note that in applications of the central- 
limit theorem, which states that X n is approximately distributed as N( ju, tx 2 /n), 
often n and a 2 are unknown and consequently they also have to be estimated.) 
The preceding indicates how powerful the asymptotic extreme-value theory can 
be. We have merely introduced the subject. For instance, we stated some re¬ 
sults for the asymptotic distribution of Yf 1 ; one could state similar results 
for Y["\ Yf, n) , or YfJ k+l . The interested reader is referred to Refs. 13, 30, 
and 35. 


5.4 Sample Cumulative Distribution Function 

We have repeatedly stated in this chapter that our purpose in sampling from some 
distribution was to make inferences about the sampled distribution, or popula¬ 
tion, which was assumed to be at least partly unknown. One question that 
might be posed is: Why not estimate the unknown distribution itself? The 
answer is that we can estimate the unknown cumulative distribution function 
using the sample, or empirical, cumulative distribution function, which is a func¬ 
tion of the order statistics. 

Definition 13 Sample cumulative distribution function Let X x ,X 2 ,..., 
X n denote a random sample from a cumulative distribution function F( •), 
and let Y t < Y 2 < • • • < Y n denote the corresponding order statistics. 
The sample cumulative distribution function, denoted by F n (x), is defined by 
F„(x) = (1/n) x (number of Yj less than or equal to x) or, equivalently, 
by F„(x) = (\/n) x (number of X ; less than or equal to x). //// 


For fixed x, F„(x) is a statistic since it is a function of the sample. (The 
dependence of F n (x) on the sample may not be clear from the notation itself.) 
We shall see that F n (x) has the same distribution as that of the sample mean of a 
Bernoulli distribution. 


Theorem 17 Let F„(x ) denote the sample cumulative distribution function 
of a random sample of size n from F(-); then • 


P 



(") lF(x)] k [ 1 -F(x)]‘ 


k = 0, 1, ..., n. 


(43) 
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proof Let Z ( =/ ( —o o.x^Xdi t hen z t has a Bernoulli distribution 

n 

with parameter F(x). Hence, £Z { , which is the number of X t less than 
or equal to x, has a binomial distribution with parameters n and F(x). But 
F n ( x ) = (1 /n) X Z f . The result follows. //// 

Much more could be said about the sample cumulative distribution func¬ 
tion, but we will wait until Chap. XI on nonparametric statistics to do so. 


PROBLEMS 

1 (a) Give an example where the target population and the sampled population 

are the same. 

(, b ) Give an example where the target population and the sampled population 
are not the same. 

2 (a) A company manufactures transistors in three different plants A, B, and C 

whose manufacturing methods are very similar. It is decided to inspect those 
transistors that are manufactured in plant A since plant A is the largest plant 
and statisticians are available there. In order to inspect a week’s produc¬ 
tion, 100 transistors will be selected at random and tested for defects. Define 
the sampled population and target population. 

(b) In part (a) above, it is decided to use the results in plant A to draw conclu¬ 
sions about plants B and C. Define the target population. 

3 (a) What is the probability that the two observations of a random sample of two 

from a population with a rectangular distribution over the unit interval will 
not differ by more than i? 

(6) What is the probability that the mean of a sample of two observations from a 
rectangular distribution over the unit interval will be between J and I ? 

4 (a) Balls are drawn with replacement from an urn containing one white and two 

black balls. Let ^ = 0 for a white ball and 2f= 1 for a black ball. For 
samples X,, X 2 ,..., X 9 of size 9, what is the joint distribution of the observa¬ 
tions? The distribution of the sum of the observations? 

(6) Referring to part (a) above, find the expected values of the sample mean and 
sample variance. 

5 Let X t ,..., X„ be a random sample from a distribution which has a finite fourth 
moment Define/x = ta[X t ],a 2 = var [X t ],= <?[(X, — ju) 3 ],/x* = <f[(A' 1 — /x) 4 ], 

X = (lhi) | X ,, and S 2 = [l/(« - 1)] £ (A', - X) 2 . 

{a) Does S 2 = [l/2«(« - 1)] J; J; {X, - Xj) 2 ? 
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*(b ) Find var [S 2 ]. 

*(c) Find cov [X, S 2 ], and note that cov [X, S 2 ] = 0 if fi 3 = 0. 

Possible Hint: Z(X t - M ) 2 = S(Af, - xy + (1/„)££( AG - fi)(Xj - M ). 

6 *(a) For a random sample of size 2 from a population with a finite (2r)th moment, 

R 

find S[M r ] and var [A/,], where M, = (1 /«) £ (-U — -Ur) r - 

i = 1 

(b) For a random sample of size n from a population with mean /a and rth 
central moment ja r , show that 


$ 


1 

» 


2 


i = 1 




= /Lr. 


7 (a) Use the Chebyshev inequality to find how many times a coin must be tossed 

in order that the probability will be at least .90 that X will lie between .4 
and .6. (Assume that the coin is true.) 

(b) How could one determine the number of tosses required in part (a) more 
accurately, i.e., make the probability very nearly equal to .90? What is the 
number of tosses? 

8 If a population has a = 2 and X is the mean of samples of size 100, find limits 
between which X—fj. will lie with probability .90. Use both-the Chebyshev 
inequality and the central-limit theorem. Why do the two results differ? 

9 Suppose that Xi and X 2 are means of two samples of size n from a population 
with variance ct 2 . Determine n so that the probability will be about .01 that the 
two sample means will differ by more than a. (Consider Y — Xi — X 2 .) 

10 Suppose that light bulbs made by a standard process have an average life of 2000 
hours with a standard deviation of 250 hours, and suppose that it is considered 
worthwhile to replace the process if the mean life can be increased by at least 
10 percent. An engineer wishes to test a proposed new process, and he is willing 
to assume that the standard deviation of the distribution of lives is about the 
same as for the standard process. How large a sample should he examine if he 
wishes the probability to be about .01 that he will fail to adopt the new process if 
in fact it produces bulbs with a mean life of 2250 hours? 

11 A research worker wishes to estimate the mean of a population using a sample 
large enough that the probability will be .95 that the sample mean will not differ 
from the population mean by more than 25 percent of the standard deviation. 
How large a sample should he take ? 

12 A polling agency wishes to take a sample of voters in a given state large enough 
that the probability is only .01 that they will find the proportion favoring a certain 
candidate to be less than 50 percent when in fact it is 52 percent. How large a 
sample should be taken? 

13 A standard drug is known to be effective in about 80 percent of the cases in which 
it is used to treat infections. A new drug has been found effective in 85 of the 
first 100 cases tried. Is the superiority of the new drug well established? (If 
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the new drug were as equally effective as the old, what would be the probability 
of obtaining 85 or more successes in a sample of 100 ?) 

14 Find the third moment about the mean of the sample mean for samples of size n 
from a Bernoulli population. Show that it approaches 0 as n becomes large 
(as it must if the normal approximation is to be valid). 

15 ( a ) A bowl contains five chips numbered from 1 to 5. A sample of two drawn 

without replacement from this finite population is said to be random if all 
possible pairs of the five chips have an equal chance to be drawn. What is 
the expected value of the sample mean ? What is the variance of the sample 
mean? 

( b ) Suppose that the two chips of part (a) were drawn with replacement; what 
would be the variance of the sample mean ? Why might one guess that this 
variance would be larger than the one obtained before ? 

*(c) Generalize part (a) by considering JV chips and samples of size n. Show that 
the variance of the sample mean is 

a 2 N-n 
~ii N- 1’ 

where a 2 is the population variance; that is 



16 If Xi, X 2 , X } are independent random variables and each has a uniform distribu¬ 
tion over (0, 1), derive the distribution of (X t + X 2 )/2 and (A, + X 2 + X 3 )/3. 

17 If X 2 , A), is a random sample from N(p,, a 2 ), find the mean and variance of 


'tiX'-xy 


n- 1 


18 On the F distribution: 

(a) Derive the variance of the F distribution. [See part (</).] 

(b) If X has an F distribution with m and n degrees of freedom, argue that l/X 
has an F distribution with n and m degrees of freedom. 

(c) If X has an F distribution with m and n degrees of freedom, show that 


mXIn 

W = --- 

1 + mX/n 


has a beta distribution. 

(d) Use the result of part (c) and the beta function to find the mean and variance 
of the F distribution. [Find the first two moments of mX/n = tV/(l — IF)]. 
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19 On the / distribution: 

(a) Find the mean and variance of Student’s / distribution. (Be careful about 
existence.) 

(b) Show that the density of a / distributed random variable approaches the 
standard normal density as the degrees of freedom increase. (Assume that 
the “constant” part of the density does what it has to do.) 

(c) If X is /-distributed, show that X 2 is F-distributed. 

(d) If X is /-distributed with k degrees of freedom, show that 1/(1 + X 2 /k) has 
a beta distribution. 

20 Let Xi, X 2 be a random sample from N( 0, 1). Using the results of Sec. 4 of 

Chap. VI, answer the following: 

(a) What is the distribution of (X 2 — X,)/V 2? 

(b) What is the distribution of ( X , + X 2 ) 2 /(X 2 - X,) 2 ? 

(c) What is the distribution of (X 2 + X t )/V(.X t - X 2 ) 2 ? 

(d) What is the distribution of 1/2 if Z = X}/X}2 

21 Let Xi ,..., X„ be a random sample from N(0, 1). Define 

1 k _ 1 n 

Xk = 7 2 X, and X„„ k =-- 2 ■ 

k i n — k k * i 


Using the results of Sec. 4, answer the following: 

(a) What is the distribution of i(A^ + A"„_ k )? 

(b) What is the distribution of kXZ + (n — k)X 2 - k ? 

(c) What is the distribution of X 2 /X |? 

(d) What is the distribution of XilX„2 

22 Let X u .... X„ be a random sample from N(n, a 2 ). Define 


1 * 

Xk = T I,X i , 

ft 1 

- 1 /r 

x„- k = —i 2 x„ 

n — k fc +1 

_ 1 « 

x-=-%x t . 


Sk = T~T 2 (Xi — Xk) 2 , 

ft — 1 1 

S 2 .k = - i —r 2{x,-x„- k y 

n — k — i ic+1 


and 


S 2 = 


1 

n- 1 


2 {x,-xy. 

l 
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Using the results of Sec. 4, answer the following: 

(а) What is the distribution of cr~ 2 [(k — 1)S» + (n — k 1)S*-*]? 

(б) What is the distribution of (i)(^» + X„- k ) ? 

(c) What is the distribution of o (X, — /i) 2 ? 

(d) What is the distribution of S2/S?-*? 

(e) What is the distribution of (X— ^)/(S/V«)? 

23 Let Z u Z 2 be a random sample of size 2 from N( 0,1) and X„ X 2 a random sample 
of size 2 from N(l, 1). Suppose the Z,'s are independent of the Ays. Use the 
results of Sec. 4 to answer the following: 

(a) What is the distribution of X + Z1 

(b) What is the distribution of (Z, + Z 2 )/"\/[(A' 2 — X t ) 2 + (Z 2 — Zi) 2 ]l27 

(c) What is the distribution of [(A'i — X 2 ) 2 + (Zi Z 2 ) 2 + (Z, + Z 2 ) 2 ]/ 2? 

(d) What is the distribution of (A" 2 + X t — 2) 2 I(X 2 — X,) 2 ? 

24 Let X, be a random variable distributed N(i, i 2 ), i = 1, 2, 3. Assume that the 
random variables X u X 2 , and X 3 are independent. Using only the three random 
variables A'i, A" 2 , and X 3 : 

(a) Give an example of a statistic that has a chi-square distribution with three 
degrees of freedom. 

( b ) Give an example of a statistic that has an F distribution with one and two 
degrees of freedom. 

(c) Give an example of a statistic that has t distribution with two degrees of 
freedom. 

25 Let A'i, X 2 be a random sample of size 2 from the density 


/(*) = le~ ix I(o. »>(*). 

Use results on the chi-square and F distributions to give the distribution of 
X 1 /x z . 

26 Let Ut, U 2 be a random sample of size 2 from a uniform distribution over the 
interval (0, 1). Let V, and Y z be the corresponding order statistics. 

, (a) For 0 < y 2 < 1, what is f Yl | r 2 =, 2 0'i | y 2 ), the conditional density of F, given 

Y z =y z 1 

(b) What is the distribution of Y z — F, ? 

27 If A'i, X z , X„ are independently and normally distributed with the same 
mean but different variances <rf, a \,..., a 2 and assuming that U = S(A',/of)/S(l /aj) 
and V — EfA", - uy/of are independently distributed, show that U is normal and 
Khas the chi-square distribution with n — 1 degrees of freedom. 

28 For three samples from normal populations (with variances of, of, and a 2 ), the 
sample sizes being n 2 , and n 3 , find the joint density of 


S? 

t/= si and 



where the Sf, S|, and Sf are the sample variances. (Assume that the samples 
are independent.) 
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29 Let a sample of size n, from a normal population (with variance ai) have sample 
variance Sf, and let a second sample of size n 2 from a second normal population 
(with mean fx 2 and variance a\) have mean X and sample variance S|. Find the 
joint density of 


U 


VnAX-n*) 


and 



(Assume that the samples are independent) 

30 For a random sample of size 2 from a normal density with mean 0 and variance 1, 
find the distribution of the range. 

31 (a) What is the probability that the larger of two random observations from any 

continuous distribution will exceed the median? 

(b) Generalize the result of part (a) to samples of size n. 

32 Considering random samples of size n from a population with density fix), what 
is the expected value of the area under /( x) to the left of the smallest sample 
observation? 

*33 Consider a random sample X t , X n from the uniform distribution over the 
interval (jj. — Via, p. + V 3a). Let F, <••• < F„ denote the corresponding 
order statistics. 

(a) Find the mean and variance of F„ - Yu 
{b) Find the mean and variance of (F, |- F„)/2. 

(c) Find the mean and variance of Y k+ , if n = 2k + 1, k = 0,1, — 

(d) Compare the variances of X n , Ffc+i, (F, + F„)/2. 

Hint: It might be easier to solve the problem for Ui, ..., U„, a random sample 
from the uniform distribution over either (0, 1) or (—1, 1), and then make an 
appropriate transformation. 

34 Let Xu • • •» X„ be a random sample from the density 


f(x; ex P [— I(* — a )lP\ 1 

where — oo < a < od and /j > 0. Compare the asymptotic distributions of the 
sample mean and the sample median. In particular, compare the asymptotic 
variances. 

*35 Let X„ ..., X„ be a random sample from the cumulative distribution function 
F{x) = {1 - exp [—x/(l - x)]}/ <0 , ,,(x) + /[..«>(*)• What is the limiting distri¬ 
bution of ( FS n> — a„)/b„, where a„ = log «/(l + log n ) and bn 1 = (log n) (1 + log n ) ? 
What is the asymptotic distribution of F„ < " > ? 

36 Let Xi ,...» Xn be a random sample from fix; 6) = 6e~ tx I i o, «)(*), 0 >0. 

(a) Compare the asymptotic distribution of X„ with the asymptotic distribution 
of the sample median. 

(b) For your choice of {a„} and {b„\, find a limiting distribution of (FS n) — a„)/bn. 

(c) For your choice of {a„} and {bn), find a limiting distribution of ( Fi <n) — a„)!bn ■ 
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PARAMETRIC POINT ESTIMATION 


1 INTRODUCTION AND SUMMARY 

Chapter VI commenced with some general comments about inference. There, 
it was indicated that a sample from the distribution of a population is useful 
in making inferences about the population. Two important problems in 
statistical inference are estimation and tests of hypotheses. One type of estima¬ 
tion, namely point estimation, is to be the subject of this chapter. 

The problem of estimation, as it shall be considered herein, is loosely 
defined as follows: Assume that some characteristic of the elements in a popula¬ 
tion can be represented by a random variable X whose density is /*(•; 6) = 
/( •; 0), where the form of the density is assumed known except that it contains 
an unknown parameter 9 (if 6 were known, the density function would be com¬ 
pletely specified, and there would be no need to make inferences about it). 
Further assume that the values x u x 2 ,..., x„ of a random sample X lt X 2 ,.. ■ , X„ 
from /(•; 0) can be observed. On the basis of the observed sample values 
x u x 2 , x„ it is desired to estimate the value of the unknown parameter 9 or 
the value of some function, say t(6), of the unknown parameter. This estima¬ 
tion can be made in two ways. The first, called point estimation, is to let the 
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value of some statistic, say /(Z l5 . X„), represent, or estimate, the unknown 
t(9); such a statistic /(X ly ..., X„) is called a point estimator. The second, 
called interval estimation , is to define two statistics, say / x (X u ..., X n ) and 
^ 2(^1 y ■ ■ • > ^V)> where J l (X l ,..., X„) < ^ 2 (^ 1 , ■ • •, 2f„), so that (/ l (X 1 ,..., X n ), 
J 2 (X U ..., X n )) constitutes an interval for which the probability can be deter¬ 
mined that it contains the unknown r(0). For example, if/(-; 9) is the normal 
density, that is. 


f(x; 6)=f(x;p, a) = 


<t>n.Ax) = 




exp 


7 10 



where the parameter 0 is (p, a), and if it is desired to estimate the mean, that is, 

n 

t(0) = p, then the statistic X = (1/n) £ X ( is a possible point estimator of 

1 

r{0) = p, and (X — 2 ^/S 2 /«, X + 2 x /S 2 /n) is a possible interval estimator 

of t(0) = p. {Recall that S 2 = [l/(n — 1)] ^ (Z f — X) 2 .} Point estimation 

1 

will be discussed in this chapter and interval estimation in the next. 

Point estimation admits two problems: the first, to devise some means 
of obtaining a statistic to use as an estimator; the second, to select criteria and 
techniques to define and find a “best” estimator among many possible estima¬ 
tors. Several methods of finding point estimators are introduced in Sec. 2. 
One of these, and probably the most important, is the method of maximum 
likelihood. In Sec. 3 several “optimum ” properties that an estimator or sequence 
ofestimators may possess are defined. These include closeness, bias and variance, 
efficiency, and consistency. The loss and risk functions, essential elements 
in decision theory, are defined as possible tools in assessing the goodness of 
estimators. 

Section 4 is devoted to sufficiency, an important and useful concept in the 
study of mathematical statistics that will also be utilized in succeeding chapters. 
Unbiased estimation is considered in Sec. 5. The Cramer-Rao lower bound 
for the variance of unbiased estimators is given, as well as the Rao-Blackwell 
theorem concerning sufficient statistics. A brief look at invariant estimators 
is presented in Sec. 6. Bayes estimation is considered in Sec. 7. A Bayes 
estimator is given as the mean of the posterior or from the decision-theoretical 
viewpoint as an estimator having smallest average risk. Some results in the 
simultaneous estimation of several parameters are given in Sec. 8. Included is 
the notion of ellipsoid of concentration of a vector of point estimators and the 
Lehmann-Scheffe theorem. Section 9 is devoted to a brief discussion of some 
optimum properties of maximum-likelihood estimators. 
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Frequent use of some of the distribution-theoretical results for statistics, 
which were derived in earlier chapters, especially Chaps. V and VI, will be 
noted throughout this chapter. After all, estimators are statistics, and to study 
properties of estimators, it is desirable to look at their distributions. 


2 METHODS OF FINDING ESTIMATORS 

Assume that X u ..., X„ is a random sample from a density/(- ;0), where the form 
of the density is known but the parameter 0 is unknown. Further assume that 
0 is a vector of real numbers, say 6 = (6 U ..., 0 k ). (Often k will be unity.) 
We sometimes say that 9 lf ■.., 6 k are k parameters. We will let S, called the 
parameter space, denote the set of possible values that the parameter 6 can 
assume. The object is to find statistics, functions of the observations X k ,..., X „, 
to be used as estimators of the 6j,j =\, Or, more generally, our object 

is to find statistics to be used as estimators of certain functions, say tj (0),..., r r (0), 
of 6 = (0 lt ..., 0 k ). A variety of methods of finding such estimators has been 
proposed on more or less intuitive grounds. Several such methods will be 
presented, along with examples, in this section. Another method, that of the 
method of least squares will be discussed in Chap. X. 

An estimator can be defined as in Definition 1. 

Definition 1 Estimator Any statistic (known function of observable 
random variables that is itself a random variable) whose values are used 
to estimate t( 0), where t(-) is some function of the parameter 0, is defined 
to be an estimator of t(0). //// 

An estimator is always a statistic which is both a random variable and a 
function. For instance, suppose X lf ..., X n is a random sample from a density 
/(' > 0) ar, d 1S desired to estimate t( 0), where r(-) is some function of 0. Let 
/(A), ..., X„) be an estimator of t( 0). The estimator /(X t , ..., X n ) can be 
thought of in two related ways: first, as the random variable, say T, where 
T = t(Xy, ..., X n ), and, second, as the function /(•, ..., •). Naturally, one 
needs to specify the function /(•, ..., •) before the random variable T = 
t{X u ..., X n ) is defined. In all we have three types of tees: the capital Latin T, 
which represents the random variable t(X k , ..., X n ), the small script /, which 
represents the function /(•, ..., •), and the small Latin t, which represents a 
value of T; that is, t = /(x 1( ..., x „). Let us adopt the convention of calling 
the statistic (or random variable) that is used as an estimator an “estimator” 
and calling a value that the statistic takes on an “estimate.” Thus the word 
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“estimator” stands for the function, and the word “estimate” stands for a 

— 1 " 

value of that function; for example, X n =— Y X t is an estimator of a mean u, 

n iti 

and 3c„ is an estimate of /<. Here T is X n , t is 3c„, and /(■,...,■) is the function 
defined by summing the arguments and then dividing by n. 

Notation in estimation that has widespread usage is the following: 9 is 
used to denote an estimate of 9, and, more generally, 9 k ) is a vector 

that estimates the vector (0,, 9 k ), where estimates 0j,j= l, k. If 

0 is an estimate of 9, then 0 is the corresponding estimator of 9; and if the 
discussion requires that the function that defines both 9 and © be specified, then 
it can be denoted by a small script theta, that is, © = §(A\, ..., A"„). 

When we speak of estimating 9 , we are speaking of estimating the fixed 
yet unknown value that 0 has. That is, we assume that the random sample 
Xi,X„ came from the density/(•; 9), where 9 is unknown but fixed. Our 
object is, after looking at the values of the random sample, to estimate the fixed 
unknown 9. And when we speak of estimating r (0), we are speaking of estimat¬ 
ing the value t( 0) that the known function r(-) assumes for the unknown but 
fixed 9. 


2.1 Methods of Moments 

Let/(•; 0i, ..., 9 k ) be a density of a random variable X which has k parameters 
9 U ..., 0 k . As before let /i' denote the rth moment about 0; that is, /ij = S [Z r ]. 
In general n' r will be a known function of the k parameters 9 U ..., 9 k . Denote 
this by writing = •••> ®k)- L e * X lr ..., X n be a random sample 

from the density /(•; 6 U ■■■, 9 k ), and, as before, let Mj be the _/th sample 
moment; that is, 

M'j = -tx{. 

n i= i 

Form the k equations 

M'j = /i^!,..., 9 k ), j — l,..., k, (1) 

in the k variables 9 k , ..., 9 k , and let ©!,...,©* be their solution (we assume 
that there is a unique solution). We say that the estimator (© 1; ..., ©*), 
where 6j estimates 9j, is the estimator of (Oj, ..., 9 k ) obtained by the method of 
moments. The estimators were obtained by replacing population moments by 
sample moments. Some examples follow. 
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EXAMPLE 1 Let X y ,..., X n be a random sample from a normal distribution 

with mean n and variance tr 2 . Let (0 1; 0 2 ) = (ji, a). Estimate the param¬ 

eters and <7 by the method of moments. Recall that it 2 = p' 2 — (0i) 2 
and n = n\. The method-of-moments equations become 

M\ = /('i = a) = /( 

M 2 = Hz = o-) = o - 2 + /i 2 , 

and their solution is the following: The method-of-moments estimator 
of u is M\ = X, and the method-of-moments estimator of a is 
J~M' 2 -X 2 = 70 Jn) I X 2 - X 2 = v l (X, - ^) 2 /«. Note that the 
method-of-moments estimator of a given above is not ^/S 2 . //// 


EXAMPLE 2 Let X ly X„ be a random sample from a Poisson distribu¬ 
tion with parameter X. Estimate X. There is only one parameter, hence 
only one equation, which is 

Mi = /*; = n\(X) = x. 

Hence the method-of-moments estimator of X is M\ = X, which says 
estimate the population mean X with the sample mean x. //// 


EXAMPLE 3 Let X t , ..., X„ be a random sample from the negative expo¬ 
nential density f(x\ 0) = Oe 9 *ho, v>) {x). Estimate 0. The method-of- 
moments equation is 

Mi = /4=/<i(0) = J; 

hence the method-of-moments estimator of 0isl/Mi = l/X. //// 


EXAMPLE 4 Let X u X n be a random sample from a uniform distribu¬ 
tion on (jj. — yj 3c, /< + 7317). Here the unknown parameters are two, 
namely /< and a, which are the population mean and standard deviation. 
The method-of-moments equations are 

M[ = //i = p\(p, a) = // 

and 

Ml = /<2 = /<;>(/<, « 7 ) = < 7 2 + H 2 ; 
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hence the method-of-moments estimators are X for p and 

for < 7 . 

We shall see later that there are better estimators of p and a for this 
distribution. //// 

Method-of-moments estimators are not uniquely defined. The method- 
of-moments equations given in Eq. (1) are obtained by using the first k raw 
moments. Central moments (rather than raw moments) could also be used to 
obtain equations whose solution would also produce estimators that would be 
labeled method-of-moments estimators. Also, moments other than the first 
k could be used to obtain estimators that would be labeled method-of-moments 
estimators. 

If, instead of estimating (fl t , ..., 0 k ), method-of-moments estimators of, 
say, t 1 (0 1 ,..., 0 t ),..., t,(0u 6 k ) are desired, they can be obtained in several 

ways. One way would be to first find method-of-moments estimates, say 
..., 6 k , of 6 l ,..., 0 k and then use %0 k , ..., 0*) as an estimate of x j (6 l , ...,0 k ) 
for j = 1, ..., r. Another way would be to form the equations 

Mj , • • ■, T r ), j 1, ..., r 

and solve them for t 1s ..., x r . Estimators obtained using either way are called 
method-of-moments estimators and may not be the same in both cases. 


2.2 Maximum Likelihood 

To introduce the method of maximum likelihood, consider a very simple estima¬ 
tion problem. Suppose that an urn contains a number of black and a number 
of white balls, and suppose that it is known that the ratio of the numbers is 
3/1 but that it is not known whether the black or the white balls are more 
numerous. That is, the probability of drawing a black ball is either ^ or 
If n balls are drawn with replacement from the urn, the distribution of X, the 
number of black balls, is given by the binomial distribution 

f{x;p) = p x (f x for x = 0, 1,2,..., «, 

where q = 1 — p and p is the probability of drawing a black ball. Here p = 
or p = . 

We shall draw a sample of three balls, that is, n = 3, with replacement and 
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attempt to estimate the unknown parameter p of the distribution. The estima¬ 
tion problem is particularly simple in this case because we have only to choose 
between the two numbers .25 and .75. Let us anticipate the results of the 
drawing of the sample. The possible outcomes and their probabilities are given 
below: 


Outcome: x 

0 

1 

2 

3 

fix; i) 

l 

64 

9 

64 

2 7 

64 

2 7 

64 

fix; i) 

2 7 

64 

2 7 

64 

9 

64 

1 

6 4 


In the present example, if we found x = 0 in a sample of 3, the estimate .25 for 
p would be preferred over .75 because the probability ^ is greater than 
i.e., because a sample with x = 0 is more likely (in the sense of having larger 
probability) to arise from a population with p = \ than from one with p = \. 
And in general we should estimate p by .25 when x = 0 or 1 and by .75 when 
x = 2 or 3. The estimator may be defined as 


p=m = 



for x = 0, 1 
for x = 2, 3. 


The estimator thus selects for every possible x the value of p, say p, such that 


f(x; p) >f(x;p’), 

where p’ is the alternative value of p. 

More generally, if several alternative values of p were possible, we might 
reasonably proceed in the same manner. Thus if we found x = 6 in a sample 
of 25 from a binomial population, we should substitute all possible values of p 
in the expression 


/(6; p) = p 6 (l - p) 19 for 0 < p < 1 


( 2 ) 


arid choose as our estimate that value of p which maximized f(P \ p) For the 
given possible values of p we should find our estimate to be The position 
of its maximum value can be found by putting the derivative of the function 
defined in Eq. (2) with respect to p equal to 0 and solving the resulting equation 
for p. Thus, 


Tp R 6; p ) = (g 5 ) p 5 ( 1 - P) 18 [6(1 - p) - 19 pi 
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and on putting this equal to 0 and solving for p, we find that p = 0, 1, -fj are 
the roots. The first two roots give a minimum, and so our estimate is therefore 
p = -fj. This estimate has the property that 

/(6; P) >/(6; p% 

where p' is any other value of p in the interval 0 < p < 1. 

In order to define maximum-likelihood estimators, we shall first define the 
likelihood function. 

Definition 2 Likelihood function The likelihood function of n random 
variables X x , X 2 , ..., X n is defined to be the joint density of the n random 
variables, say f Xu i> •••> x nl $)> which is considered to be a function 

of 9. In particular, if X u ..., X„ is a random sample from the density 
/(x; 9), then the likelihood function is/(x,; 9)f(x 2 ; 9) ./(x„; 9). //// 

Notation To remind ourselves to think of the likelihood function as a 
function of 6 , we shall use the notation L(9; x t ,..., x„) or L( -; x u ..., x„) 
for the likelihood function. //// 

The likelihood function 1X9; x lr ..., x„) gives the likelihood that the 
random variables assume a particular value x 1( x 2 , ..., x„. The likelihood is 
the value of a density function; so for discrete random variables it is a proba¬ 
bility. Suppose for a moment that 9 is known; denote the value by 9 0 . The 
particular value of the random variables which is “ most likely to occur” is that 
value xj, x' 2 , ..., x' n such that/ Xl> _ >Xn {x u ■■■, ; 0 O ) is a maximum. For 

example, for simplicity let us assume that n = 1 and X 2 has the normal density 
with mean 6 and variance 1. Then the value of the random variable which is 
most likely to occur is X t = 6. By “most likely to occur” we mean the value 
xj of X 2 such that 4> 6t i(x'i) > <f> 6 _ l (x 1 ). Now let us suppose that the joint 
density of n random variables is f Xu x„( x i, •••» x n', 0), where 9 is unknown. 
Let the particular values which are observed be represented by xj, xj, ..., x' n . 
We want to know from which density is this particular set of values most likely 
to have come. We want to know from which density (what value of 0) is the 
likelihood largest that the set xj, ..., x'„ was obtained. In other words, we 
want to find the value of 9 in (5, denoted by 9, which maximizes the likelihood 
function L(0 ; xj,..., xj). The value 9 which maximizes the likelihood function 
is, in general, a function of x u ..., x„, say 9 = 9(x ly x 2 , x„). When this is 
the case, the random variable © = 9(X U X 2 , ..., X n ) is called the maximum- 
likelihood estimator of 9. (We are assuming throughout that the maximum 
of the likelihood function exists.) We shall now formalize the definition of a 
maximum-likelihood estimator. 
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Definition 3 Maximum-likelihood estimator Let 

L(0)=L(0; x,,.... x„) 

be the likelihood function for the random variables X u X 2 , ■■■, X n . If 
f) [where S = S(x u x 2 ,. - -, x„) is a function of the observations x u ...,x n ] 
is the value of 0 in 0 which maximizes L(0), then O i > X 2 > • • • > 
is the maximum-likelihood estimator of 6. 0 = Q(x k , x n ) is the 

maximum-likelihood estimate of 0 for the sample x u ..., x„. //// 


The most important cases which we shall consider are those in which 
X u X 2 , ■ ■■, X„ is a random sample from some density f(x; 6), so that the 
likelihood function is 

L(0) =/(x,; 0)f(x 2 ', 0) ■ ■ '/(x„; 0). 

Many likelihood functions satisfy regularity conditions; so the maximum- 
likelihood estimator is the solution of the equation 


dL(9) 

dO 


= 0 . 


Also L(6) and log L(0) have their maxima at the same value of 6, and it is some¬ 
times easier to find the maximum of the logarithm of the likelihood. 

If the likelihood function contains k parameters, that is, if 

@2> • • • » 0*) — 0 f( X i > &2 > • • • > Ok)’ 

i= 1 

then the maximum-likelihood estimators of the parameters 6 i , 0 2 , ..., 6 k are 
the random variables ..., X n ), 0 2 = § 2 (Xi, X n ), Q k = 

9 k (X u X n ), where 6 k , 9 2 , ..., 0 k are the values in 0 which maximize 
L(0 U 6 2 ,...,0 k ). 

If certain regularity conditions are satisfied, the.point where the likelihood 
is a maximum is a solution of the k equations 


dL(e u . 

■■’0 k ) 

= 0 

50! 

0U0y, • 

■■’0k) 

= 0 

d0 2 

8L(0 1 , • 

■■,0k) 

= 0. 

80 k 


In this case it may also be easier to work with the logarithm of the likelihood. 
We shall illustrate these definitions with some examples. 
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EXAMPLE 5 Suppose that a random sample of size n is drawn from the 
Bernoulli distribution 

fix; p ) = pY _ */ { 0 , i}(*)> 0 ^ p <, 1 and q = 1 - p. 

The sample values x ly x 2 , ■ ■ ■, x n will be a sequence of Os and 1 s, and the 
likelihood function is 


L(p) = n p Xi q l Xi = p Zxi q " Zxi , 

i— 1 

and if we let 

y = 'L x i> 

we obtain 


and 


♦log L(p) = y log p + (n - y) log q 


d log L{p) = y _ n-y 
dp p q 


remembering that q = 1 — p. On putting this last expression equal to 0 
and solving for p, we find the estimate 

P = l = l'L x i = *’ ( 3 ) 

n n 


which is intuitively what the estimate for this parameter should be. It is 
also a method-of-moments estimate. For n = 3, let us sketch the likeli¬ 
hood function. Note that the likelihood function depends on the x ; ’s 
only through £ x ; ; thus the likelihood function can be represented by the 
following four curves: 

L 0 = L(p; X x, = 0) = (1 - p) 3 
Li = L(j> ; X x,- =1) = p(l - p) 2 
L 2 = Lip ; X x; = 2) = p 2 (l - p) 

L 3 = L{p; X Xi = 3) = p 3 , 

which are sketched in Fig. 1. Note that the point where the maximum of 
each of the curves takes place for 0 < p < 1 is the same as that given in 
Eq. (3) when n = 3. //// 


Recall that log x means log c x. 
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EXAMPLE 6 A random sample of size n from the normal distribution has 
the density 

fl yk - (if [- S* Z- " )2 ■ 

The logarithm of the likelihood function is 

L * = - ^ log 2n - n - log <r 2 - 2^2 L (*, ~ /0 2 , 

where <r > 0 and — oo < < oo. 

To find the location of its maximum, we compute 

8L* 1 _ 

and 

dL* n l 1 

^ r= ~2? + 27^ (Xi_ /l)2 ’ 

and on putting these derivatives equal to 0 and solving the resulting 
equations for /( and er 2 , we find the estimates 

& = \1 x t = x (4) 

& 2 = l -1( Xi -x) 2 , ( 5 ) 

which turn out to be the sample moments corresponding to )i and <r 2 . 

1111 
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EXAMPLE 7 Let the random variable X have a uniform density given by 

f{x', 0) = Iie-t'B+tfx), 

where — oo < 6 < oo; that is, © = real line. The likelihood function 
for a sample of size n is 

L{6‘, x u ..., x n ) = n /(*, 1 0) = ri he-i,e+i\( x i) 

(=i i=i 

= + ( 6 ) 

where y Y is the smallest of the observations and y„ is the largest. The 

. n 

last equality in Eq. (6) follows since n he-i,e + d x i) is unit y if and on, y 

1 = 1 

if all x u ..., x„ are in the interval [Q — 6 + %\, which is true if and 
only if 6 — ^ < y t and y n < 0 + which is true if and only if 
y„ — i < 6 < y y + We see that the likelihood function is either 1 
(for y n — \ < 6 < y t 4- i) or 0 (otherwise); hence any statistic with value 
8 satisfying y n — \ < 8 < 4- i is a maximum-likelihood estimate. 

Examples are y„ - 4- i, and Hy t + y„). This latter is the midpoint 

between y„~i and y t + %, or the midpoint between y, and y n , the 
smallest and largest observations. //// 


EXAMPLE 8 Let the random variable X have a uniform distribution with 
density given by 

fix; 6 ) =fix; <r) = 

where — oo < n < oo and cr > 0. (Recall Example 4.) Here the likeli¬ 
hood function for a sample of size n is 

Lin, Xi, ■ • •, x„) = 0 I u - v3a,/x+V3<r]( x i) 

= ( 2^/3 a) <r - ^ 0'1 + ^3 <x]0’n) 

I 

f[(jx-)>l)/V3. oc)( (7 )^[(l'„-/J)/v / 3, v)i°) fyi, oo)0'n)> 

where jq is the smallest of the observations and y n is the largest. The 
likelihood function is (2 s j3o)~ n in the shaded area of Fig. 2 and 0 else¬ 
where. (2 v / 3<7) - " within the shaded area is clearly a maximum when a 
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2 



is smallest, which is at the intersection of the lines n - J'io = y>i and 
H + ^ 3 a = y n . Hence the maximum-likelihood estimates of p and a are 

fi = ~ (Ei + JO (7) 

and 

< 8 > 

which are quite different from the method-of-moments estimates given in 
Example 4. //// 


The above four examples are sufficient to illustrate the application of the 
method of maximum likelihood. The last two show that one must not always 
rely on the differentiation process to locate the maximum. 

The function L(6) may, for example, be represented by the curve in Fig. 3, 
where the actual maximum is at 6 , but the derivative set equal to 0 would locate 
6 ’ as the maximum. One must also remember that the equation 8L/80 = 0 
locates minima as well as maxima, and hence one must avoid using a root of 
the equation which actually locates a minimum. 

We shall see in later sections (especially Sec. 9 of this chapter) that the 
maximum-likelihood estimator has some desirable optimum properties other 
than the intuitively appealing property that it maximizes the likelihood function. 
In addition, the maximum-likelihood estimators possess a property which is 
sometimes called the invariance property of maximum-likelihood estimators. A 
little reflection on the meaning of a single-valued inverse will convince one of 
the validity of the following theorem. 
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FIGURE 3 


Theorem 1 Invariance property of maximum-likelihood estimators Let 

0 = 9(T 1; X 2 , ..., X„) be the maximum-likelihood estimator of 0 in the 
density f{x; 0), where 0 is assumed unidimensional. If t(-) is a function 
with a single-valued inverse, then the maximum-likelihood estimator of 
m is T(0). Illl 

For example, in the normal density with /i 0 known the maximum-likelihood 
estimator of <r 2 is 

- X — Z'o) 2 - 

n (=i 

By the invariance property of maximum-likelihood estimators, the maximum- 
likelihood estimator of a is 

Similarly, the maximum-likelihood estimator of, say, log <r 2 is 

•og - X ~ Z'o ) 2 • 

_n t= i 

The invariance property of maximum-likelihood estimators that is exhib¬ 
ited in Theorem 1 above can and should be extended. Following Zehna [43], 
we extend in two directions: First, 9 will be taken as ^-dimensional rather 
than unidimensional, and, second, the assumption that t(-) has a single-valued 
inverse will be removed. It can be noted that such extension is necessary by 
considering two simple examples. As a first example, suppose an estimate of 
the variance, namely 0(1 — 0), of a Bernoulli distribution is desired. Example 
5 gives the maximum-likelihood estimate of 0 to be x, but since 0(1 — 0) is not 
a one-to-one function of 0, Theorem 1 does not give the maximum-likelihood 
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estimator of 0(1 - 0). Theorem 2 below will give such an estimate, and it will 
be 3c(l — x). As a second example, consider sampling from a normal distribu¬ 
tion where both p and a 2 are unknown, and suppose an estimate of S[X 2 \ = 
p 2 + a 2 is desired. Example 6 gives the maximum-likelihood estimates of 
and a 2 , but n 2 + <r 2 is not a one-to-one function of /i and a 2 , and so the 
maximum-likelihood estimate of p 2 + a 2 is not known. Such an estimate will 
be obtainable from Theorem 2 below. It will be 5c 2 + (1/n) £ (x ; — 5c) 2 . 

Let 0 = (0 1 , ..., 6 k ) be a jfc-dimensional parameter, and, as before, let 0 
denote the parameter space. Suppose that the maximum-likelihood estimate 
of t(0) = (tjOO,..., x r (0)), where 1 < r < k, is desired. Let T denote the range 
space of the transformation t(-) = , ?,('))- T is an r-dimensional 

space. Define M(x; x u x n ) = sup L{6; x u ..., x„). M (-; x u x n ) 

{fl: t(8) = t) 

is called the likelihood function induced by t(-).* When estimating 0 we max¬ 
imized the likelihood function L(6 ; x u ..., x n ) as a function of 0 for fixed 
x u ..., x„; when estimating x = x(6) we will maximize the likelihood function 
induced by t(-), namely M(t; x k ,x n ), as a function of x for fixed x L , ..., x n . 
Thus, the maximum-likelihood estimate of i= t(0), denoted by t, is any 
value that maximizes the induced likelihood function for fixed x u ..., x n ; that 
is, t is such that M(f; x 1; ..., x„) > M( x; x 1; ..., xj for all x e T. The invari¬ 
ance property of maximum-likelihood estimation is given in the following 
theorem. 


Theorem 2 Let 0 = (0 1; ..., 0 k ), where 0,. = 3/^, ..., X„), be a 
maximum-likelihood estimator of 0 = (0 1# ..., 0 k ) in the density 
/(• i 0 1 > • • •» 0 t ). If t(0) = (t 1 (0), ..., r r (0)) for 1 < r < k is a transforma¬ 
tion of the parameter space 0, then a maximum-likelihood estimator of 
t( 0) = (ti(0),..., x r (0)) is t( 0) = ( Tl (0), ..., Tr (0)). [Note that t/0) = 
Tj.(0i, ..., 0 k ); so the maximum-likelihood estimator of x j (0 l , ..., 0 k ) is 
t/©!, ..., B k ),j= 1,..., r .] 

proof Let 0 = (0 ls ..., 8 k ) be a maximum-likelihood estimate of 
0 = (0i, ---> °k)- It suffices to show that M(r(0); x 1; ..., x n ) > 
M(x, Xj, ..., x„) for any x e T, which follows immediately from the in¬ 
equality M(t; xj,..., x„) = sup L(0; Xl ,... yXn )< sup L (6; x u ..., x n ) 

. {0: t( 0) r} 

= L(0 ; Xl ,..., x n ) = ^ ^sup ^L(6 ; x k ,..., x „) = M(t( 0); Xl ,..., x„). //// 


♦The nomion “sup” is used here, and elsewhere in this book, as it is usually used in 

m th . en }t tlCS '„ . F ° r t | hOSC /® aders who are n <>t acquainted with this notation, not much 
is os i sup is rep aced by max, where max is an abbreviation for maximum. 
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It is precisely this property of invariance enjoyed by maximum-likelihood 
estimators that allowed us in our discussion of maximum-likelihood estimation 
to consider estimating (0 lt 6 k ) rather than the more general 7,(0,, ..., 0 k ), 

x r {e u ..., e k ). 


EXAMPLE 9 In the normal density, let 0 = (0,, 0 2 ) = (p, a 2 ). Suppose 
t(0) = p + z q a, where z„ is given by 4>(z q ) = q. T (0) is the q\h. quantile. 
According to Theorem 2, the maximum-likelihood estimator of t (0) is 
X + zJoM I (X, - X) 2 . //// 

2.3 Other Methods 

There are several other methods of obtaining point estimators of param¬ 
eters. Among these are (i) the method of least squares, to be discussed in 
Chap. X, (ii) the Bayes method, to be discussed later in.this chapter, (iii) the 
minimum-chi-square method, and (iv) the minimum-distance method. In this 
subsection we will briefly consider the last two. Neither will be used again in 
this book. 

Minimum-chi-square method Let X x , ..., X„ be a random sample from a 
density given by / x (x; 0), and let 5^,, ..., k be a partition of the range of X. 
The probability that an observation falls in cell Z/’ J ,j= 1, ..., k, denoted by 
Pj(6), can be found. For instance, if f x ( x - 0) is the density function of a con¬ 
tinuous random variable, then pj(6) = P[X falls in cell j\ = j f x (x ; 0) dx. 

k 

Note that £ p/0) = 1. Let the random variable Nj denote the number of 
j= i 

k 

Xi’s in the sample which falls in cell Sfj , j = 1, ..., k; then £ Nj = n, the 

j= i 

sample size. Form the following summation: 

2 = y ~ npj{0)} 2 

1 j =i npj(Q) 

where n, is a value of N } . The numerator of they'th term in the sum is the square 
of the difference between the observed and the expected number of observations 
falling in cell £Pj. The minimum-chi-square estimate of 0 is that 0 which 
minimizes y 2 - It is that 0 among all possible 0’s which makes the expected 
number of observation in cell ^“nearest” the observed number. Theminimum- 
chi-square estimator depends on the partition £P U ..., if k selected. 
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EXAMPLE 10 Let X u . X n be a random sample from a Bernoulli distri¬ 
bution; that is, f x (x; 0) = <ni-G)'- x for x = 0, 1. Take N, = the 
number of observations equal to j for j = 0, 1. Here the range of the 
observation Xis partitioned into the two sets consisting of the numbers 0 

and 1 respectively. 

2 _ ‘ [n, - n P] m 2 = ["o ~ ”0 Z 9>>]2 + (gl ~ n6)2 
1 " npfiO) «(I - °) nQ 

_ [n-n, — n(l — 0)] 2 ("i ~ = (”t ~ nff ) 2 1 

n(l-0) nQ n 0(1-0)' 

The minimum of y 2 as a function of 0 can be found by inspection by 
noting that y 2 = 0 for 0 = njn. Hence 0 = /?,//?. For this example 
there was only one choice for the partition k . The estimator 

found is the same as what would be obtained by either the method of 
moments or maximum likelihood. //// 


Often it is difficult to locate that @ which minimizes y 2 ; hence, the denomi¬ 
nator npj(0) is sometimes changed to tij (if rij =0, unity is used) forming a 

k 

modified y 2 = £ {[n ; — npj(0)] 2 /nj}. The modified minimum-chi-square estimate 
J= i 

of 0 is then that 0 which minimizes the modified y 2 . 

Minimum-distance method Let X,, ..., X n be a random sample from the 
distribution given by the cumulative distribution function F x (x\ 0) = F(x ; 0), 
and let d(F, G ) be a distance function that measures how “ far apart ” two cumula¬ 
tive distribution functions F and G are. An example of a distance function is 
d(F, G) — sup | F(x) — G(x)|, which is the largest vertical distance between F 

X 

and G. See Fig. 4. 

The minimum-distance estimate of 0 is that 0 among all possible 0 for 
which d{F{x\ 0), F n (x)) is minimized, where F n (x) is the sample cumulative 
distribution function. Thus, 0 is chosen so that F(x; 0) will be “closest” to 
F n (x), which is desirable since we saw in Subsec. 5.4 of Chap. VI that for a 
fixed argument x the sample cumulative distribution function has the same 
distribution as the mean of a binomial distribution; hence, by the law of large 
numbers F„(x) “converges” to F(x). The minimum-distance estimator might be 
intuitively appealing, but it is almost always difficult to find since locating 
0 which minimizes d(F(x ; 0), F„(x)) is seldom easy. The following example is 
an exception. 
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EXAMPLE 11 Again let X lt X n be a random sample from a Bernoulli 
distribution; then 


F(x; 8) = (1 - 0)/ to , d(*) + / t i, oo)W>- 


where 0 < 8 ^ 1. 
Then 


Let Kj = the number of observations equal to j;j = 0,1. 

F n (x) = ~ I[o,d(x) + I lu «,)(*)• 


Now if the distance function d(F, G) = sup | F(x) - G(x) | is used, then 

JC 

d(F{x\ 8 ), F„(x )) is minimized if 1 - 6 is taken equal to n 0 /n or 6 = n x !n = 
£ X;/n. Hence 6 = x. //// 


For a more thorough discussion of the minimum-chi-square method, see 
Cramer [11] or Rao [17]. The minimum-distance method is discussed in 
Wolfowitz [42], 


3 PROPERTIES OF POINT ESTIMATORS 

We presented several methods of obtaining point estimators in the preceding 
section. All the methods were arrived at on a more or less intuitive basis. The 
question that now arises is: Are some of many possible estimators better, in 
some sense, than others? In this section we will define certain properties, which 
an estimator may or may not possess, that will help us in deciding whether one 
estimator is better than another. 

3.1 Closeness 

If we have a random sample X { , X„ from a density, say f(x; 8), which is 
known except for 8, then a point estimator of t( 8) is a statistic, say /(X 1 ,..., X„), 
whose value is used as an estimate of r(8). We will assume here that r(8) is a 
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real-valued (not a vector) function of the unknown parameter 9. [Often t( 0) 
will be 6 itself.] Ideally, we. would like the value of ,..., X n ) to be the 
unknown t( 0), but this is not possible except in trivial cases, one of which follows. 


EXAMPLE 12 Assume that one can sample from a density given by 

f(x; 6) = I($-±,e+±)(x), 

where 6 is known to be an integer. That is, 0, the parameter space, 
consists of all integers. Consider estimating 9 on the basis of a single 
observation x,. If A*i) is assigned as its value the integer nearest x lt then 
the statistic or estimator /( X t ) will always correctly estimate 9. In a 
sense, the problem posed in this example is really not statistical since one 
knows the value of 9 after taking one observation. //// 


Not being able to achieve the ultimate of always correctly estimating the 
unknown x(6), we look for an estimator /(X t , X n ) that is “close” to t (8). 
There are several ways of defining “close.” T = f(X ly ..., X n ) is a statistic 
and hence has a distribution, or rather a family of distributions, depending on 
what 9 is. The distribution of T tells how the values t of T are distributed, and 
we would like to have the values of T distributed near t (0 ); that is, we would like 

to select /{■ .•) so that the values of T = /{X u ..., X n ) are concentrated 

near x(9). We saw that the mean and variance of a distribution were, respec¬ 
tively, measures of location and spread. So what we might require of an 
estimator is that it have its mean near or equal to x(0) and have small variance. 
These two notions are explored in Subsec. 3.2 below and then again in Sec. 5. 

Rather than resorting to characteristics of a distribution, such as its mean 
and variance, one can define what “concentration” might mean in terms of the 
distribution itself. Two such definitions follow. 

Definition 4 More concentrated and most concentrated Let T = 

t(X\, ..., A„) and T = f(X iy ..., X n ) be two estimators of t(0). T' is 
called a more concentrated estimator of t (9) than T if and only if 
P»[ t(0) - A < T < x (9) + A] > P 9 [t(0) - A < T < x(9) + X] for all X > 0 
and for each 0 in S. An estimator T* = t*(X u ..., x„) is called most 
concentrated if it is more concentrated than any other estimator. //// 

Remark The subscript 9 on the probability symbol P e [ ] is there to 
emphasize that, in general, such probability depends on 9. For instance, 
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in P e [x(9) — A < T < x(9) + A], the event {t (9) - A < T <, x(9) + A} is 
described in terms of the random variable T, and, in general, the distri¬ 
bution of T is indexed by 9. //// 

We see from the definition that the property of most concentrated is 
highly desirable (Pitman [41], in defense of his calling a 'most concentrated 
estimator best, stated that such an estimator is “undeniably best’’); unfortu¬ 
nately, most concentrated estimators seldom exist. There are just too many 
possible estimators for any one of them to be most concentrated. What is then 
sometimes done is to restrict the totality of possible estimators under con¬ 
sideration by requiring that each estimator possess some other desirable property 
and to look for a best or most concentrated estimator in this restricted class. 
We will not pursue the problem of finding most concentrated estimators, even 
within some restricted class, in this book. 

Another criterion for comparing estimators is the following one. 

Definition 5 Pitman-closer and Pitman-closest Let T = f(X 1 , ..., X„) 
and T' = /'(X l ,..., X„) be two estimators of r(0). T' is called a Pitman- 
closer estimator of x(6) than T if and only if 

P e [ | T' - t{0)| < | T - r(0)|] > i for each 0 in 0. 

An estimator T* is called Pitman-closest if it is Pitman-closer than any 
other estimator. //// 

The property of Pitman-closest is, like the property of most concentrated, 
desirable, yet rarely will there exist a Pitman-closest estimator. Both Pitman-closer 
and more concentrated are intuitively attractive properties to be used to com¬ 
pare estimators, yet they are not always useful. Given two estimators T and 
T', one does not have to be more concentrated or Pitman-closer than the other. 
What often happens is that one, say T, is Pitman-closer or more concentrated 
for some 0 in 0, and the other T' is Pitman-closer or more concentrated for 
other 0 in 0; and since 0 is unknown, we cannot say which estimator is preferred. 
Since Pitman-closest estimators rarely exist for applied problems, we will not 
devote further study to the notion in this book; instead, we will consider other 
ways of measuring the closeness of an estimator to x(0). 

Competing estimators can be compared by defining a measure of the close¬ 
ness of an estimate to the unknown x(0). An estimator T' = f'{X t , ..., X n ) 
of x(0) will be judged better than an estimator T = i(X x , ..., X„ ) if the measure 
of the closeness of T' to t( 0) indicates that T' is closer to x(0) than T. Such 
concepts of closeness will be discussed in Subsecs. 3.2 and 3.4. 
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In the above we were assuming that n, the sample size, was fixed. Still 
another meaning can be affixed to “closeness” if one thinks in terms of increasing 
sample size It seems that a good estimator should do better when it is based 
on a large sample than when it is based on a small sample. Consistency and 
asymptotic efficiency are two properties that are defined in terms of increasing 
sample size; they are considered in Subsec. 3.3. Properties of point estimators 
that are defined for a fixed sample size are sometimes referred to as small-sample 
properties, whereas properties that are defined for increasing sample size are 
sometimes referred to as large-sample properties. 

3.2 Mean-squared Error 

A useful, though perhaps crude, measure of goodness or closeness of an esti¬ 
mator ..., X n ) of r(0) is what is called the mean-squared error of the 

estimator. 

Definition 6 Mean-squared error Let T =/( X v ..., X„) be an 
estimator of t (6). (o‘ 0 [[T — t(8)] 2 ] is defined to be the mean-squared error 
of the estimator T = /(X { , .X n ). //// 

Notation Let MS E/d) denote the mean-squared error of the estimator 
T = t(X x ,..., X n ) of t(0). //// 

Remark The subscript 8 on the expectation symbol indicates from 
which density in the family under consideration the sample came. That is, 

*»[[T - < 8 )] 2 ] 

= g e mX u ...,X n )-t(8)} 2 ) 

= /•'•/•■•,*.)- ?(0)] 2 f( Xl ; 8) • • •/(*„; 8) dx t • • • dx n , 

where/(x; 8) is the probability density function from which the random 
sample was selected. jm 

The name “ mean-squared error ” can be justified if one first thinks of 
the difference t - v(8), where t is a value of T used to estimate t (8), as the error 
made in estimating r(0), and then interprets the “mean” in “mean-squared 
error” as expected or average. To support the contention that the mean- 
squared error of an estimator is a measure of goodness, one merely notes that 
&ellT — t( 0)] 2 ] is a measure of the spread of T values about x (8), just as the 
variance of a random variable is a measure of its spread about its mean. If we 
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FIGURE 5 



were to compare estimators by looking at their respective mean-squared errors, 
naturally we would prefer one with small or smallest mean-squared error. We 
could define as best that estimator with smallest mean-squared error, but such 
estimators rarely exist. In general, the mean-squared error of an estimator 
depends on d. 

For any two estimators T t = ..., A"„) and T 2 = / 2 (^i> • • •, X n ) of 

r(d), their respective mean-squared errors MSE^(d) and MSE /2 (d) as functions 
of d are likely to cross; so for some d, has smaller MSE, and for others has 
smaller MSE. We would then have no basis for preferring one of the estimators 
over the other. See Fig. 5. 

The following example shows that except in very rare cases an estimator 
with smallest mean-squared error will not exist. 


EXAMPLE 13 Let X t , X n be a random sample from the density f(x; d), 
where d is a real number, and consider estimating d itself; that is, t(d) = d. 
We seek an estimator, say T* = ^*(X lt ..., X„), such that MSE^d) < 
MSE^d) for every d and for any other estimator T = f(X lt ..., X„) of d. 
Consider the family of estimators T 9o = i 9o {X u ..., X n ) = d 0 indexed by 
d 0 for d 0 e 0. For each d 0 belonging to 9, the estimator T So ignores the 
observations and estimates d to be d 0 . Note that 

MSE,„ o (d) = £ e [Ve 0 {X u ... , X n ) - d] 2 ] 

= <f 9 [(d 0 - d) 2 ] = (d 0 - d) 2 ; 


so MSE, 8n (do) = 0; that is, the mean-squared error of /„ 0 evaluated at 
d = d 0 is 0. Hence, if there is to exist an estimator T* = /*(X U X n ) 
satisfying MSE^(d) < MSE/d) for every d and for any estimator 
/, MSE^d) = 0. [For any d 0 , MSE,.(d 0 ) = 0 since MSE,.(d 0 ) < 
MSE /o (d 0 ) = 0.] In order for an estimator /* to have its mean- 
squared error identically 0, it must always estimate d correctly, which 
means that from the sample you must be able to identify the true parameter 
value. //// 
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One reason for being unable to find an estimator with uniformly smallest 
mean-squared error is that the class of all possible estimators is too large— 
it includes some estimators that are extremely prejudiced in favor of particular 0. 
For instance, in the example above J eo ( x i> ■■■ - x „) is highly partial to 0 O since 
it always estimates 0 to be 0 O . One could restrict the totality of estimators by 
considering only estimators that satisfy some other property. One such 
property is that of unbiasedness. 

Definition 7 Unbiased An estimator T = J(X l , X„) is defined to 

be an unbiased estimator of r(0) if and only if 

£ e [T] = $ e V( x i ,...,X n )] = t(0) for all G e 5. //// 

An estimator is unbiased if the mean of its distribution equals t( 0), the 
function of the parameter being estimated. Consider again the estimator 
J 9o (X u •••, X„) = 0 o of the above example; $ 9 [t 9o (X u ..., X n )] = £' 0 [d 0 ] = 
0 O A 0; so J 9o (X 1 ,X„) is not an unbiased estimator of 9. If we restricted the 
totality of estimators under consideration by considering only unbiased estima¬ 
tors, we could hope to find an estimator with uniformly smallest mean-squared 
error within the restricted class, that is, within the class of unbiased estimators. 
The problem of finding an unbiased estimator with uniformly smallest mean- 
squared error among all unbiased estimators is dealt with in Sec. 5 below. 

Remark 

MSE,(0) = var [T] + (t(0) - S 0 [T}} 2 . (9) 

So if T is an unbiased estimator of t (9), then MSE,(0) = var [T]. 

PROOF 

MSE/0) = £„[[T - t(0)] 2 ] = £ e [((T - S e [T})- {x(0) - S’ e [T]}) 2 } 

= £ g [(T - £ e [T]) 2 ] - 2{t (0) - S e [T]}S e [T - £ e [T]] 

+ <? 9 [{t(0) - £ e [T]} 2 ] = var [T] + {t(0) - £ e [T]} 2 . //// 

The term t(0) — <%g[T] is called the bias of the estimator T and can be 
either positive, negative, or zero. The remark shows that the mean-squared 
error is the sum of two nonnegative quantities; it also shows how the mean- 
squared error, variance, and bias of an estimator are related. 


EXAMPLE 14 Let X u X n be a random sample from f(x; 0) = ff 2 (-v). 

Recall that the maximum-likelihood estimators of fi and a 2 are, respec¬ 
tively, X and (1/m) £ {X ; - X) 2 . (See Example 6.) Now £ e [X] = ju; so 
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X is an unbiased estimator of and hence the mean-squared error of 
X = S e [{X — fi) 2 ] = var [X] = a 2 ln. We know that <f d [S 2 ] — a 2 ; so 

Win) £ (*, - X) 2 ] = [(» - l)/n]*,[[l/(i. - 1)] I or, - *) 2 ] 

= [(» - 1)/«K 9 [S 2 ] = [(« - 1 )/n]a 2 . 


Hence the maximum-likelihood estimator of a 2 is not unbiased. The 
mean-squared error of (1/n) £ (X t — X) 2 is 

‘Ml /n)Z(*i-X) 2 -a 2 ] 2 ] 

= var [(1/n) £ (*, - X) 2 ] + {a 2 - £ e [(\/«) £ (X, - X) 2 ]} 2 


(« - l ) 2 


var [S 2 ] + 




_(n- l) 2 1/ «-3 _ 4 \ t a 4 

2 I ^4 i ^ I 2 9 

n n\ n — 1 / n 

using Eq. (10) of Theorem 2 in Chap. VI. 


Ill / 


Remark For the most part, in the remainder of this book we will take 
the mean-squared error of an estimator as our standard in assessing the 
goodness of an estimator. //// 


3.3 Consistency and BAN 

In the previous subsection we defined the mean-squared error of an estimator 
and the property of unbiasedness. Both concepts were defined for a fixed 
sample size. In this subsection we will define two concepts that are defined 
for increasing sample size. In our notation for an estimator of r (6), let us use 
T n = t n (X u X „), where the subscript n of /indicates sample size. Actually 
we will be considering a sequence of estimators, say 7) = Z' 1 (T 1 ), T 2 = / 2 (X 1 , X 2 ) 
T 3 =/ 3 (A r 1 , X 2 , X 3 ), ..., T n = J„(X l , ..., X n ), — An obvious example is 

_ n 

T n = J n (X i, . X n ) = X n =(l/n)£ X t . Ordinarily the functions /„ in the 

i = 1 

sequence will be the same kind of function for each n. 

When considering a sequence of estimators, it seems that a good sequence 
of estimators should be one for which the values of the estimators tend to get 
closer to the quantity being estimated as the sample size increases. The following 
definitions formalize this intuitively desirable notion of limiting closeness. 

Definition 8 Mean-squared-error consistency Let T lt T 2 , ..., T n ... 

be a sequence of estimators of t( 0), where T„ = f„(X 1 , ..., X n ) is based 
on a sample of size n. This sequence of estimators is defined to be a 
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mean-squared-error consistent sequence of estimators of r(0), if and only if 
lim <^[[7; - t( 0)] 2 ] = 0 for all 0 in g. //// 

n-*oo 

Remark Mean-squared-error consistency implies that both the bias 
and the variance of T„ approach 0 since S e [[T n - t(0)] 2 ] = var [T„] 
+ (t(0) - S e [T n ]f. 1111 


EXAMPLE 15 In sampling from any density having mean ji and variance 

n 

a 2 , let X n ={\!n)Yj X i he a sequence of estimators of p and S„ = 
<=i 

[l/(« - 1)] £ (A, - X„) be a sequence of estimators of a 2 . S[(X n - p) 2 ] = 
1=1 

var [X„] = a 2 In -> 0 as n -> oo; hence the sequence {XJ is a mean-squared- 
error consistent sequence of estimators of p. 

<^[(S 2 - <r 2 ) 2 ] = var [S 2 ] = o 4 j -*• 0 

as n-+oo, using Eq. (10) of Chap. VI; hence the sequence {S 2 } is a 
mean-squared-error consistent sequence of estimators of a 2 . Note that if 
T„ = (1 /«) £ (X; — X) 2 , then the sequence {T n } is also a mean-squared- 
error consistent sequence of estimators of a 2 . //// 


There is another weaker notion of consistency given in the following 
definition. 


Definition 9 Simple consistency Let T t , T 2 ,..., T n , ... be a sequence 

of estimators of r(9), where T„ = / n (X L . X„). The sequence {T„} is 

defined to be a simple (or weakly) consistent sequence of estimators of r (9) 
if for every e > 0 the following is satisfied: 

lim P 9 [r(0) — e < T n < r (9) + e] = 1 for every 9 in 0. //// 

B-+0O 


Remark If an estimator is a mean-squared-error consistent estimator, 
it is also a simple consistent estimator, but not necessarily vice versa. 

PROOF 


P e [r(0) -e<T n < r (9) + e] = P[\ T n - r(0)| < e] 


= Pe[[T n - t ( 0)] 2 < e 2 ] > 1 


*.[[T„ - t(0)] 2 ] 
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by the Chebyshev inequality. As n approaches infinity, £ e [[T„ - t(0)] 2 ] 
approaches 0; hence lim P e [x(6) - e < T n < x(8) + e] = 1. //// 


We close this subsection with one further large-sample definition. 

Definition 10 Best asymptotically normal estimators (BAN estimators) 

A sequence of estimators T\ f, ..., T*, ... of x(6) is defined to be best 
asymptotically normal (BAN) if and only if the following four conditions 
are satisfied: 

(i) The distribution of sfn[T* — t( 0)] approaches the normal 
distribution with mean 0 and variance a* 2 {6) as n approaches infinity. 

(ii) For every e > 0, 

lim P e [ | T* — t (6) | > e] = 0 for each 0 in 0. 

ti~* OO 

(iii) Let {T„} be any other sequence of simple consistent estimators 
for which the distribution of n[T„ — x{&)] approaches the normal dis¬ 
tribution with mean 0 and variance <x 2 (6). 

(iv) tr 2 (0) is not less than o* 2 (6) for all 0 in any open interval. //// 


Remark The abbreviation BAN is sometimes replaced by CANE, 
standing for consistent asymptotically normal efficient, //// 


The usefulness of this definition derives partially from theorems proving 
the existence of BAN estimators and from the fact that ordinarily reasonable 
estimators are asymptotically normally distributed. 

It can be shown that for samples drawn from a normal density with 

n 

mean /i and variance a 2 the sequence T* = (l/n) £ X t = X„ for n = 1, 2, ... 

i= 1 

is a BAN estimator of /x. In fact, the limiting distribution of s/n(X„ — p) is 
normal with mean 0 and variance a 2 , and no other estimator can have smaller 
limiting variance in any interval of p values. However, there are many other 
estimators for this problem which are also BAN estimators of p, that is, esti¬ 
mators with the same normal distribution in the limit. For example, 


T' = ■ 


Y J X i , n = 1 , 2 ,..., 


is a BAN estimator of p. BAN estimators are necessarily weakly consistent 
by (ii) of the definition. 
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3.4 Loss and Risk Functions 

In Subsec. 3.2 we used mean-squared error of an estimator as a measure of the 
closeness of the estimator to t( 0). Other measures are possible, for example, 

^[|T-t(0)|], 

called the mean absolute deviation. In order to exhibit and consider still other 
measures of closeness, we will borrow and rely on the language of decision 
theory. On the basis of an observed random sample from some density 
function, the statistician has to decide what to estimate t( 0) to be. One might 
then call the value of some estimator T = /(X t , ..., X n ) a decision and call the 
estimator itself a decision function since it tells us what decision to make. Now 
the estimate t of r(0) might be in error; if so, some measure of the severity of 
the error seems appropriate. The word “loss” is used in place of “error,” 
and “loss function” is used as a measure of the “error.” A formal definition 
follows. 

Definition 11 Loss function Consider estimating t(0). Let t denote 
an estimate of t (0). The loss function, denoted by /(t ; 0), is defined to 
be a real-valued function satisfying (i) S(t; 6) > 0 for all possible estimates 
t and all 6 in 0 and (ii) /(t; 6) = 0 for t = r(0). tf(t; 0) equals the loss 
incurred if one estimates r(0) to be t when 0 is the true parameter value. //// 

In a given estimation problem one would have to define an appropriate 
loss^f«flcfion for the particular problem under study. It is a measure of the 
error and presumably would be greater for large error than for small error. We 
would want the loss to be small; or, stated another way, we want the error in 
estimation to be small, or we want the estimate to be close to what it is estimating. 


EXAMPLE 16 

0 ) 

Oi) 

(iii) 

(iv) 


Several possible loss functions are: 

0) = [t- t(0)] 2 . 

T(0)|. 

if I* — t{0) | > E 

\0 if 1 1 - r(0)| < e, where A > 0. 
/ 4 (t; 0) = p(0) 1 1 - x{0) | r for p(0) > 0 and r > 0. 


t 2 (f, 0) = 1 1 
an 0) = U 


h is called the squared-error loss function, and £ z is called the absolute- 
error loss function. Note that both <f t and <? 2 increase as the error t - r(0) 
increases in magnitude. <f 3 says that you lose nothing if the estimate t 
is within e units of t{ 0) and otherwise you lose amount A. / 4 is a general 
loss function that includes both / 1 and ( z as special cases. //// 
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We assume now that an appropriate loss function has been defined for our 
estimation problem, and we think of the loss function as a measure of error 
or loss. Our object is to select an estimator T = f(X u ..., X„) that makes 
this error or loss small. (Admittedly, we are not considering a very important, 
substantive problem by assuming that a suitable loss function is given. In 
general, selection of an appropriate loss function is not trivial.) The loss 
function in its first argument depends on the estimate t, and t is a value of the 
estimator T ; that is, t = t(x u ..., jc n ). Thus, our loss depends on the sample 
X t , ..., X n . We cannot hope to make the loss small for every possible sample, 
but we can try to make the loss small on the average. Hence, if we alter our 
objective of picking that estimator that makes the loss small to picking that 
estimator that makes the average loss small, we can remove the dependence of 
the loss on the sample X lt .X n . This notion is embodied in the following 
definition. 

Definition 12 Risk function For a given loss function /(•; •), the risk 
function, denoted by fMfO), of an estimator T = t(X x , X n ) is defined 
to be 

= WT; 9)]. (10) 

//// 

The risk function is the average loss. The expectation in Eq. (10) can be 
taken in two ways. For example, if the density f{x\ 9) from which we sampled 
is a probability density function, then 

0)] = Xn); 0)] 

=j. • • J f(A x i, • • •,*„); 0) n /(*■•; 0) dx t . 

Or we can consider the random variable T and the density of T. We get 

WT; 0)] = j>( ; 9)f T (t) dt, 

where f T (t) is the density of the estimator T. In either case, the expectation 
averages out the values of x t , x n . 


EXAMPLE 17 Consider the same loss functions given in Example 16. The 
corresponding risks are given by: 

(i) — t(0)] 2 ], our familiar mean-squared error. 

(ii) <£ e [ \ T - t( 0)|], the mean absolute error. 

(iii) A-P e [\T - x(0)\ >e]. 

Ov) P {6)£ e [\T-mvi 


hh 
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Our object now is to select an estimator that makes the average loss (risk) 
small and ideally select an estimator that has the smallest risk. To help meet 
this objective, we use the concept of admissible estimators. 

Definition 13 Admissible estimator For two estimators = 

..., X n ) and T 2 = •••> X»), estimator is defined to be a 

better estimator than i 2 if and only if 

^ for all 9 in 0 

and 

f° r at least one 9 in 0. 

An estimator T = /(X lt ..., X„) is defined to be admissible if and only if 

there is no better estimator. //// 

In general, given two estimators /j and t 2 neither is better than the other; 
that is, their respective risk functions as functions of 9, cross. We observed 
this same phenomenon when we studied the mean-squared error. Here, as 
there, there will not, in general, exist an estimator with uniformly smallest risk. 
The problem is the dependence of the risk function on 9. What we might do 

is average out 6 , just as we average out the dependence on x x . x„ when 

going from the loss function to the risk function. The question then is: Just 
how should 9 be averaged out? We will consider just this problem in Sec. 7 
on the Bayes estimators. Another way of removing the dependence of the risk 
function on 9 is to replace the risk function by its maximum value and compare 
estimators by looking at their respective maximum risks, naturally preferring 
that estimator with smallest maximum risk. Such an estimator is said to be 
minimax. 

Definition 14 Minimax An estimator t* is defined to be a minimax 

estimator if and only if sup <*,.(0) < sup M/0) for every estimator /. //// 
9 e 

Minimax estimators will be discussed in Sec. 7. 


4 SUFFICIENCY 

Prior to continuing our pursuit of finding best estimators, we introduce the 
concept of sufficiency of statistics. In many of the estimation problems that 
we will encounter, we will be able to summarize the information in the sample 
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*i, ..., x n . That is, we will be able to find some function of the sample that 
tells us just as much about 9 as the sample itself. Such a function would be 
sufficient for estimation purposes and accordingly is called a sufficient statistic. 

Sufficient statistics are of interest in themselves, as well as being useful in 
statistical inference problems such as estimation or testing of hypotheses. 
Because the concept of sufficiency is widely applicable, possibly the notion 
should have been isolated in a chapter by itself rather than buried in this chapter 
on estimation. 

4.1 Sufficient Statistics 

Let X lt ..., X n be a random sample from some density, say /(•; 6). We defined 
a statistic to be a function of the sample; that is, a statistic is a function with 
domain the range of values that (X lt ..., X„) can take on and counterdomain 
the real numbers. A statistic T = /(X lt ..., X„) is also a random variable; it 

condenses the n random variables X u X 2 . X„ into a single random variable. 

Such condensing is appealing since we would rather work with unidimensional 
quantities than n-dimensional quantities. We shall be interested in seeing if we 
lost any “information” by this condensing process. The condensing can also be 

viewed another way. Let X denote the range of values that (Xi . X n ) can 

assume. For example, if we sample from a Bernoulli distribution, then X is a 
collection of all n-dimensional vectors with components either 0 or 1; or if we 
sample from a normal distribution, then X is an n-dimensional euclidean space. 
Now a statistic induces or defines a partition of X. (Recall that a partition of X 
is a collection of mutually disjoint subsets of X whose union is £.) Let 
/(•, ..., •) be the function corresponding to the statistic T = t(X v , ..., X n ). 
The partition induced by /(•,..., •) is brought about as follows: Let t 0 denote 
any value of the function /(•, ..., •); that subset of X consisting of all those 
points (xj, ..., x n ) for which /(x t , ..., x„) = t 0 is one subset in the collection 
of subsets which the partition comprises; the other subsets are similarly formed 
by considering other values of /(•,..., •)• For example, if a sample of size 3 
is selected from a Bernoulli distribution, then X consists of eight points 
(0, 0, 0), (0, 0, 1), (0, 1, 0), (1, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 0), (1, 1, 1). Let 
/(x u x 2 , x 3 ) = x, + x 2 + x 3 ; then /(•, -, •) takes on the values 0, 1, 2, and 3. 
The partition of X induced by /(■, •, •), consists of the four subsets {(0, 0, 0)}, 
{(0, 0, 1), (0, 1, 0), (1, 0, 0)}, {(0, 1, 1), (1, 0, 1), (1, 1, 0)}, and {(1,1, 1)} corre¬ 
sponding, respectively, to the four values 0, 1,2, and 3 of /(•, •, •)• A statistic 
then is really a condensation of X. In the above example, if we use the statistic 
/(•, •, •), we have only four different values to worry about instead of the eight 
different points of X. 
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Several different statistics can induce the same partition. In fact, if 
/(• •) is a statistic, then any one-to-one function of t has the same partition 

as /. In the example above /(x x , x 2 , x 2 ) = 6(x x + x 2 + x^) , or even 
t"{x x 2 xr 3 ) = x\ + x\ + *3 , induces the same partition as t(x x , x 2 , x 3 ) = 
x + x 2 + x 3 One of the reasons for using statistics is that they do condense 
X and if such is our only reason for using a statistic, then any two statistics with 
the same partition are of the same utility. The important aspect of a statistic 
is the partition of X that it induces, not the values that it assumes. 

A sufficient statistic is a particular kind of statistic. It is a statistic that 
condenses X in such a way that no “information about 9” is lost. The only 
information about the parameter 8 in the density/(•; 9) from which we sampled 
is contained in the sample X x , ..., X„; so, when we say that a statistic loses no 
information, we mean that it contains all the information about 8 that is con¬ 
tained in the sample. We emphasize that the type of information of which we 
are speaking is that information about 8 contained in the sample given that we 
know the form of the density; that is, we know the function/(•; ■) in/(-; 6), 
and the parameter 0 is the only unknown. We are not speaking of information 
in the sample that might be useful in checking the validity of our assumption 
that the density does indeed have form/(•; •). 

Now we shall formalize the definition of a sufficient statistic; in fact, we 
shall give two definitions, namely, Definitions 15 and 16. It can be argued that 
the two definitions are equivalent, but we will not do it. 

Definition 15 Sufficient statistic Let X x . X n be a random sample 

from the density /(•; 9), where 6 may be a vector. A statistic S = 
a(X it X„) is defined to be a sufficient statistic if and only if the con¬ 
ditional distribution of X x , ..., X n given S = 5 does not depend on 9 for 
any value s of S. //// 

Note that we use S = a(X x , .... X n ), instead of T = /(X x , X n ), to 
denote a sufficient statistic. Some care is required in interpreting the condi¬ 
tional distribution of X x ,..., X n given S = s, as Example 19 and the paragraph 
preceding it demonstrate. 

The definition says that a statistic S = a(X x , ..., X n ) is sufficient if the 
conditional distribution of the sample given the value of the statistic does not 
depend on 9. The idea is that if you know the value of the sufficient statistic, 
then the sample values themselves are not needed and can tell you nothing more 
about 9, and this is true since the distribution of the sample given the sufficient 
statistic does not depend on 9. One cannot hope to learn anything about 9 by 
sampling from a distribution that does not depend on 9. 
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EXAMPLE 18 Let X u X 2 , X 3 be a sample of size 3 from the Bernoulli 
distribution. Consider the two statistics S = a(XX 2 , ^ 3 ) = ^ 
+ X 2 + X 3 and T = /(X lt X 2 , ^ 3 ) = X t X 2 + X 3 . We will show that 
o(-, •, •) is sufficient and /(■, •, )isnot. This first column of Fig. 6is3£. 



Values of 

S 

Values of 

T 

fxi.x 2 .x 3 is 

fxi,X 2 .X 3 \T 

(0,0,0) 

0 

0 

1 

1 — P 

1 +P 

(0,0,1) 

1 

1 

1 

3 

1 —P 

1+2 p 

(0,1,0) 

1 

0 

1 

3 

P 

1 +P 

(1,0,0) 

1 

0 

1 

3 

P 

1 +P 

(0,1,1) 

2 

1 

1 

3 

P 

1+2 p 

(1,0,1) 

2 

1 

1 

3 

P 

1 + 2/7 

(1,1,0) 

2 

1 

1 

3 

P 

1 + 2/7 

(1,1,1) 

3 

2 

1 

1 


FIGURE 6 


The conditional densities given in the last two columns are routinely 
calculated. For instance, 


1, 0|1) = P[X t = 0; X 2 = 1; X 3 = 0|S = 1] 

P[Xi = 0; X 2 = 1; X 3 = 0; S = 1] 
P[S = 1] 

(1 - p)p( 1 - p) 1 

P( 1 - P) 2 3 
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and 

/xi,X2,X 3 |T = o(0> 1j0|0) — 


P[X t =0;X 2 = l;X 3 = 0; T = 0] 
P[T = 0] 


Q - pfp ^ p 

(1 - p) 3 + 2(1 - pfp 1 -p + 2p 


P 

1 + p' 


The conditional distribution of the sample given the values of S is inde¬ 
pendent of p; so S is a sufficient statistic; however, the conditional distribu¬ 
tion of the sample given the values of Tdepends on p; so Tis not sufficient. 
We might note that the statistic T provides a greater condensation of X 
than does S. A question that might be asked is: Is there a statistic which 
provides greater condensation of jE than does S which is sufficient as well? 
The answer is “ no ” and can be verified by trying all possible partitions of 
X consisting of three or fewer subsets. //// 


In the case of sampling from a probability density function, the meaning 
of the term “the conditional distribution of X u ..., X„ given S = s” that 
appears in Definition 15 may not be obvious since then P[S = ,s] = 0. We can 
give two interpretations. The first deals with the joint cumulative distribution 
function and uses Eq. (9) of Subsec. 3.3 in Chap. IV; that is, to show that 
S = a(X u X n ) is sufficient, one shows that P[ X t < x,; ...; X„ < x n \ S = j] 
is independent of 9, where P[X t <x,; ...; A'„<x„|S = ^] is defined as in 
Eq. (9) of Chap. IV. The second interpretation is obtained if a one-to-one 
transformation of X u X 2 ,..., X n to, say, S, Y 2 , ..., Y„ is made, and then it is 
demonstrated that the density of Y 2 ,Y n given S = s is independent of 6. If 
the distribution of Y 2 ,Y„ given S = sis independent of 9, then the distribu¬ 
tion of S, Y 2 , ..., Y„ given S = i is independent of 9, and hence the distribution 
of 3r t , X 2 , ..., X„ given S = s is independent of 9. These two interpretations 
are illustrated in the following example. 


EXAMPLE 19 Let X u ..., X„ be a random sample from/(-; 9) = (j) e j(-); 
that is, X u X n is a random sample from a normal distribution with 
mean 9 and variance unity. In order to expedite calculations, we take 
n=2. Let us argue that S = a(X i , X 2 ) = x t + X 2 is sufficient using 
the second interpretation above. The transformation of (X u X 2 ) to 
(S> k 2 )> where S = X x + X 2 and Y 2 = X 2 — X lt is one-to-one; so it 
suffices to show that/ y2 | S (y 2 1 s) is independent of 9. Now 


fr 2 \s(y2 I s ) 


fr 2 ,s(y 2 ,s) = fy 2 (y 2 )f s (s ) = 

/s( s > fs(s) 
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(using the independence of X 1 + X 2 and X 2 - X 2 that was proved in 
Theorem 8 of Chap. VI), but 


since 


fr 2 (y 2 ) = 


e -i(n 2 /2) 


Y 2 ~ N( 0, 2), 

which is independent of 6. 

The necessary calculations for the first interpretation above are 
less simple. We must show that P[X t < x t ; X 2 < x 2 \ S = i] is inde¬ 
pendent of 6. According to Eq. (9) of Chap. IV, 

P[Xi < x 2 ; X 2 < x 2 1S = s] = lim P[X { < x t ; X 2 < x 2 \ s — h < S < s + h\. 

h-> 0 

Without loss of generality, assume that x l < x 2 . We have the following 
three cases to consider: (i) 5<x 1( (ii) x t <s<x 2 , and (iii) x 2 < s. 
P[X t < x 2 ; X 2 < x 2 \S = s] is clearly 0 (and hence independent of 6) for 
case (iii). Let us consider (i). [Case (ii) is similar.] 

P[X! < Xj; X 2 < x 2 1 S = s] 

= lim P[X 2 ^ x 2 ; X 2 £ x 2 1 i — h < S < s + h\ 

h-0 

lim — P[ X l < Xi; X 2 < x 2 ;s — h < S < s + h] 
a-> o 2 h _ 

lim — P[s — ft < S <s + h] 

/■-.o 2 h 

lim — P[X 2 < xi ; X 2 < x 2 ; s — h < S <s + h] 

_ h-> o 2 h 

fs( s ) 


Note that (see Fig. 7) 

J ,x, .s + h-u 

lim— I fxS u )fx 1 (v) dv du 

h~*0 Ln ^s + h — x 2 J s — h — u 

< lim -j P[X t < x t ; X 2 < x 2 ; s — h < X x + X 2 < s + h] 
h->0 2h 

1 .Xl ,s+h-u 

< lim - I fx t (u)fxJ v ) dv du ’ 

h~*0 Ln J s — h—x 2 ^ s — h — u 
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FIGURE 7 


and hence 

lim oI p [^i <x l ;X 2 <x 2 ;s-h<X 1 + X 2 <s + h ] 

h~* 0 Zn 

= f A»/x 2 (s - «) <*m. 

* / S“X2 

Finally, then, 

j ^ Xj j 2 ^ X 2 | S = s] 

_ 1.V-X2 fxi( U )fx 2 ( S ~ w) dll 

' / s (s) 

Jfl X2 (l/2lt)g -«« 2 -2»9 + ^ + ( S -u)^-2( S -u) 9 + ^] ^ 

( l / ySOe -* 1 ^ -2 *^! 2 

(l/y^)J*l X2g -* [ “ 2 + (s -"> 2] dn 
(l/y2)c-« s2/2) 

which is independent of 6. //// 
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Definition 15 of a sufficient statistic is not very workable. First, it does 
not tell us which statistic is likely to be sufficient, and, second, it requires us to 
derive a conditional distribution which may not be easy, especially for con¬ 
tinuous random variables. In Subsec. 4.2 below, we will present a criterion 
that may aid us in finding sufficient statistics. 

Although we will not so argue, the following definition is equivalent to 
Definition 15. 

Definition 16 Sufficient statistic Let X u .... X„ be a random sample 
from the density/(•; 9). A statistic S = a(X u X„) is defined to be a 
sufficient statistic if and only if the conditional distribution of T given S 
does not depend on 9 for any statistic T = i{X u ..., X n ). //// 

Definition 16 is particularly useful in showing that a particular statistic 
is not sufficient. For instance, to prove that a statistic T' = • • • > Y n ) is 

not sufficient, one needs only to find another statistic T = /(X j, ..., X„) for 
which the conditional distribution of T given T' depends on 9. 

For some problems, no single sufficient statistic exists. However, there 
will always exist jointly sufficient statistics. 

Definition 17 Jointly sufficient statistics Let X u ..., X„ be a random 
sample from the density/(•; 6). The statistics S 1( ..., S r are defined to be 

jointly sufficient if and only if the conditional distribution of X t . X„ 

given = ij, ..., S r = s r does not depend on 9. //// 

The sample X l , X„ itself is always jointly sufficient since the condi¬ 
tional distribution of the sample given the sample does not depend on 9. Also, 
the order statistics Y t ,Y n are jointly sufficient for random sampling. If the 
order statistics are given, say, by (Tj = y u ..., Y n = y „), then the only values 

that can be taken on by (X t , ..., X„) are the permutations of y t . y„. Since 

the sampling is random, each of the nl permutations is equally likely. So, given 
the values of the order statistics the probability that the sample equals a partic¬ 
ular permutation of these given values of the order statistics is 1/n!, which is 
independent of 9. (Sufficiency of the order statistics also follows from Theorem 
5 below.) 

If we recall that the important aspect of a statistic or set of statistics is the 
partition of X that it induces, and not the values that it takes on, then the 
validity of the following theorem is evident. 
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Theorem 3 If St — &i(.Xi, •••> 4,) , • • • , S r <y r {X^^ ..., A n ) is a set of 
jointly sufficient statistics, then any set of one-to-one functions, or trans¬ 
formations, of Si,..., s r is also jointly sufficient. //// 

For example, if £ and Xf are jointly sufficient, then X and 
£ ^ = £ xf — m-A 2 are also jointly sufficient. Note, however, that 

X 2 and V (X; — X) 2 may not be jointly sufficient since they are not one-to-one 
functions of £ X ( and y Xf. 

We note again that the parameter 9 that appears in any of the above three 
definitions of sufficient statistics can be a vector. 


4.2 Factorization Criterion 

The concept of sufficiency of statistics was defined in Definitions 15 to 17 above. 
In many cases, a relatively easy criterion for examining a statistic or set of 
statistics for sufficiency has been developed. This is given in the next two 
theorems, the proofs of which are omitted. 

Theorem 4 Factorization theorem (single sufficient statistic) Let X u 

X 2 , •.., X n be a random sample of size n from the density /(•; 0), where 
the parameter 9 may be a vector. A statistic S = a(X t ,..., X„) is sufficient 

n 

if and only if the joint density of X 2 ,..., X n , which is ®)»factors as 

(=i 

fxx .x„(*l, 0) = £(4 *i> • • •. *»); 0)4*i, •••,*„) 

= 9(s;9)h(x (11) 

where the function h(x u ..., x n ) is nonnegative and does not involve the 
parameter 9 and the function g(o(x u ..., *„); 9) is nonnegative and 
depends on x u ..., x„ only through the function o(- .•). //// 

Theorem 5 Factorization theorem (jointly sufficient statistics) Let X u 

X 2 , X n be a random sample of size n from the density/(•; 9), where 
the parameter 9 may be a vector. A set of statistics Sj = a l (X 1 ,X n ), 

..., S r = o r (Xi, ..., A„) is jointly sufficient if and only if the joint density 
of X u ..., X n can be factored as 

fxi x„(*l >•••,*»> 0) 

= £( d i(*i, • • •, * n ), • • •, a r (*i, • • •, rj; 9)h(x l , ..., x„) 

= g(s u ..., s r ; 9)h(x 1 , ...,x n ). 


( 12 ) 
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where the function h(x 1 , ..., x n ) is nonnegative and does not involve the 
parameter 0 and the function g(s u s r ; 9) is nonnegative and depends 
on x u ..x n only through the functions a r (- .•)• IIII 

Note that, according to Theorem 3, there are many possible sets of suffi¬ 
cient statistics. The above two theorems give us a relatively easy method for 
judging whether a certain statistic is sufficient or a set of statistics is jointly 
sufficient. However, the method is not the complete answer since a particular 
statistic may be sufficient yet the user may not be clever enough to factor the 
joint density as in Eq. (11) or (12). The theorems may also be useful in discover¬ 
ing sufficient statistics. 

Actually, the result of either of the above factorization theorems is in¬ 
tuitively evident if one notes the following: If the joint density factors as 
indicated in, say, Eq. (12), then the likelihood function is proportional to 
g(Si, ..., .sy; 6), which depends on the observations x u ..., x„ only through 

a 1 . d r [the likelihood function is viewed as a function of 6 , so h(x lt ..., x n ) 

is just a proportionality constant], which means that the information about 
6 that the likelihood function contains is embodied in the statistics .•), 

* * - j ) * * • > )* 

Before giving several examples, we remark that the function h(-, ..., •) 
appearing in either Eq. (11) or (12) may be constant. 


EXAMPLE 20 Let X lt X„ be a random sample from the Bernoulli density 
with parameter 0; that is, 

f(x\ 6) = 9 x (l — 0) 1- */(o,u(X) an d O<0<1. 

Then 


n/(* ; ; °)=f \° x< ( 1 - {0> i,Oi) 

i— 1 i = 1 

= 0^a -ey-^hho,u(xd- 

i = 1 

it 

If we take 9 Tx, ( 1 -9) n ~ Zxi as g(a(x t . x n ); 0) and n Ao.uC*;) as 

£= 1 

h(x l , ..., x n ) and set o(x t , ..., x n ) = Y,x t , then the joint density of 
X u X n factors as in Eq. (11), indicating that S = a(X l , . .., X n ) = Y A ; 
is a sufficient statistic. //// 
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EXAMPLE 21 Let X t , X n be a random sample from the normal density 
with mean fi and variance unity. Here the parameter is denoted by n 
instead of 0. The joint density is given by 

n 

fxt . X„( X 1’ • • • ’ X n’ /*) _ n $11, 



If we take..., x n ) = [l/(27r)" /2 ] exp(-^x?)and^(j(x 1; ...,x„);n) = 
exp [n X x t — (n/2)fi 2 ], then the joint density has been factored as in 
Eq. (11) with o(x t , ..., x„) = X x >> hence X X, is a sufficient statistic. 
(Recall that X n is also sufficient since any one-to-one function of a suffi¬ 
cient statistic is also sufficient.) //// 


EXAMPLE 22 Let X u X„ be a random sample from the normal density 
$»,**(')■ Here the parameter 0 is a vector of two components; that is, 
6 = (n, a). The joint density of X t , ..., X n is given by 


n n 



l 


1 

“( 2 W 2 ° 


1 

“ {Inf 12 ° 


•exp[-‘ £ (^) ! ] 

' eXP [~ 2? £ *'* - & X x > + "« ! >]! 


so the joint density itself depends on the observations x u ..., x„ only 
through the statistics a t ( x i, . .., x n )='£x i and a 2 (x u x n ) = X xf; 
that is, the joint density is factored as in Eq. (12) with h(x l , ..., x n ) = 1. 
Hence, X ^i an ^ X are jointly sufficient. It can be shown that 
X n Mid S = [l/(n — 1)] X ( X t — X) 2 are one-to-one functions of X X t 
S > so and S are also jointly sufficient. //// 
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EXAMPLE 23 Let X t , .X n be a random sample from a uniform distribu¬ 
tion over the interval [0 U 0 2 ]. The joint density of X u ..., X n is given by 

* 1 

fxi .X„( x l> • • • > x n> ®1> 62) = FI a /T 

i— 1 U 2 — V i 

1 n 

= (e 2 - 0l yW^ (xd 

1 r , ‘ 

— (0 2 — 0A" 

where 

y t = min [x„ ..., x„] and y n = max [x t . x n ]. 

The joint density itself depends on x it ..., x n only through y t and y„; 
hence it factors as in Eq. (12) with h(x l , x„) = 1. The statistics lj 
and Y n are jointly sufficient. Note that if we take 0 t = 6 and 0 2 = 0 + 1, 
then Y t and Y n are still jointly sufficient. However, if we take = 0 and 
0 2 = 6, then our factorization can be expressed as 

1 * 

/xt. X„(*1, • • •» 9 ) = - n ^0, «](*() 

V 1=1 

= gfiho,e-l(yn)ho,y„l(yi)- 

Taking g(o(x u ...,x„); 0) = (l/0")/ [o> n (y n ) and h(x u ...,x n ) = / [0>yn] (yi), 
we see that Y n alone is sufficient. //// 

The factorization criterion of Eqs. (11) and (12) is primarily useful in 
showing that a statistic or set of statistics is sufficient. It is not useful in 
proving that a statistic or set of statistics is not sufficient. The /act that we 
cannot factor the joint density does not mean that it cannot be factored; it could 
be that we are just not able to find a correct factorization. 

If we go back and look through our examples on maximum-likelihood 
estimators (see Examples 5 to 8), we will see that all the maximum-likelihood 
estimators that appear there depend on the sample X 1 ,..., X„ through sufficient 
statistics. This is not something that is characteristic of the relatively simple 
examples we had given but something that is true in general. 

Theorem 6 A maximum-likelihood estimator or set of maximum- 
likelihood estimators depends on the sample through any set of jointly 
sufficient statistics. 
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PROOF If s t = M X U •••> x «)> s t = a k (X u X„) are jointly 
sufficient, then the likelihood function can be written as 

L(B , x^, •. •, x,,) 

= Y\f(Xud) 

i=l 

= giOiiXl, •••! *«)> • • • > • • • > ■*»)> @)h{ x l , • • • ; X n)' 

As a function of 6, L(6 ; x„ .... x„) will have its maximum at the same 
place that #(>,, ..., 0) has its maximum, but the place where g attains 

its maximum can depend on x„ .... x„ only through s t , s k since g 
does. //// 

We might note that method-of-moment estimators may not be functions 
of sufficient statistics. See Examples 4 and 23. 


4.3 Minimal Sufficient Statistics 

When we introduced the concept of sufficiency, we said that our objective was to 
condense the data without losing any information about the parameter. We 
have seen that there is more than one set of sufficient statistics. For example, 
in sampling from a normal distribution with both the mean and variance un¬ 
known, we have noted three sets of jointly sufficient statistics, namely, the sample 
X t , ..., X n itself, the order statistics Ij, ..., Y n , and X and S 2 . We naturally 
prefer the jointly sufficient set X and S 2 since they condense the data more than 
either of the other two. (Note that the order statistics do condense the data.) 
The question that we might ask is: Does there exist a set of sufficient statistics 
that condenses the data more than X and S 2 ? The answer is that there does 
not, but we will not develop the necessary tools to establish this answer. The 
notion that we are alluding to is that of a minimum set of sufficient statistics, 
which we label minimal sufficient statistics. 

We noted earlier that corresponding to any statistic is the partition of X 
that it induces. The same is true of a set of statistics; a set of statistics induces 
a partition of X. Loosely speaking, the condensation of the data that a statistic 
or set of statistics exhibits can be measured by the number of subsets in the 
partition induced by that statistic or set of statistics. If a set of statistics has 
fewer subsets in its induced partition than does the induced partition of another 
set of statistics, then we say that the first statistic condenses the data more than 
the latter. Still loosely speaking, a minimal sufficient set of statistics is then a 
sufficient set of statistics that has fewer subsets in its partition than the induced 
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partition of any other set of sufficient statistics. So a set of sufficient statistics 
is minimal if no other set of sufficient statistics condenses the data more. A 
formal definition is the following. 

Definition 18 Minimal sufficient statistic A set of jointly sufficient 
statistics is defined to be minimal sufficient if and only if it is a function of 
every other set of sufficient statistics. //// 

Like many definitions, Definition 18 is of little use in finding minimal 
sufficient statistics. A technique for finding minimal sufficient statistics has 
been devised by Lehmann and Scheffe [19], but we will not present it. If the 
joint density is properly factored, the factorization criterion will give us minimal 
sufficient statistics. All the sets of sufficient statistics found in Examples 20 
to 23 are minimal. 

4.4 Exponential Family 

Many of the parametric families of densities that we have considered are 
members of what is called the exponential class, or exponential family, not to be 
confused with the negative exponential family of densities which is a special case. 

Definition 19 Exponential family of densities A one-parameter family 
(0 is unidimensional) of densities /(•; 0) that can be expressed as 

f(x; 0) = a(9)b(x) exp [c(0)d(.x)] (13) 

for —oo < x < oo, for all 0 6 0, and for a suitable choice of functions 
af), b( ■), c(-), and d(-) is defined to belong to the exponential family or 
exponential class. //// 


EXAMPLE 24 If f{x\ 0) = 8e 9x f 0iO0) (x), then f(x; 0) belongs to the expo¬ 
nential family for a(8) = 0, b{x) = / (0> ^(x), c(0) = -0, and d(x) = x in 
Eq. (13). ’ HU 


EXAMPLE 25 If/(x; 0) = f(x; X) is the Poisson density, then 


= e / { o. i, ...>(*)) exp (x log X) 
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In Eq. (13), we can take a(A) = e A , b{x) = (l/x!)/ {0> ...)(*), c(A) = log A, 
and d(x) = x; so f(x; A) belongs to the exponential family. //// 


Remark If f{x; 6) = a(d)b(x) exp [c(ff)d(x)], then 


Y\f( Xi -,6) = a\e) 

i= 1 


fl b(xd exp c(0) £ d(x t ) , 
i=l J i — 1 J 


and hence by the factorization criterion £ d(X t ) is a sufficient statistic. //// 


The above remark shows that, under random sampling, if a density belongs 
to the one-parameter exponential family, then there is a sufficient statistic. In 
fact, it can be shown that the sufficient statistic so obtained is minimal. 

The one-parameter exponential family can be generalized to the k-param- 
eter exponential family. 

Definition 20 k- parameter exponential family A family of densities 

/(•; 6 U 6j,) that can be expressed as 

k 

f(x\ d k ) =a(6 u ..., O k )b(x) exp £ c/d,,..., O^d/x) (14) 

for a suitable choice of functions a(-, •), b(-), Cj(-, •), and dj(-), 

j = 1, ..., k, is defined to belong to the exponential family. //// 

In Definition 20, note that the number of terms in the sum of the exponent 
is k, which is also the dimension of the parameter. 


EXAMPLE 26 If f(x; 0„ 0 2 ) = where (0 U 0 2 ) = (n,a),thenf(x; 6 U 0 2 ) 

belongs to the exponential family. 


f(x; 9 U d 2 ) 


- exp 


nm 


y/2na 

= 7S' xp (“5?) exp (-2? lJ+ ?4 

Take a(ji, a) = (1 ijlna) exp (-^ • n 2 ja 2 ), b(x) = 1, q <ji, a) = -l/2cr 2 , 

L ■(^’ ~ k/<r 2 , d t (x) = x 2 , and d 2 (x) = x to show that 0 o2 (x) can be 

expressed as in Eq. (14). ’ ^ 
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EXAMPLE 27 If 

m 01, 02) = - *) 92_1 Ao, !)(*). 

then 

/(*; 9 u 02) = T )/fl 1 fl v 7 ( 0 ,i) (*) exp [(0! - 1) log x + (0 2 - 1) log(l - x)]; 
B\yu u2) 

so /(x; 9 t , 0 2 ) belongs to the exponential family with a(9 i , 0 2 ) = 
1/8(0!, 0 2 ), Kx)=I ( o.»(x), cM = 0i~ 1, c 2 (0) = 0 2 -l, di(oc) = 
log x, and d 2 (x) = log (1 — x). //// 

Remark If/(x; 0 U ..., 0 k ) = a(6 u 9 k )b(x) exp £ .0*)d-(x), 

;=i 

then, under random sampling, 

ft /(x«; 0i,.... 0 k ) 

i = 1 

=^(0!, .. •, 0*) [ ft 0(x ; )l exp [ X C/0J, •. •, 0 k ) £ d/x,)] , 

L*=i J U=i >•=i J 

n n 

and so by the factorization criterion £d k {X^) . £ d k (Xi) is a set of 

i=i i=i 

jointly sufficient statistics. £d l (X i ), ...,£ d k (X?) are in fact minimal 
sufficient statistics. //// 


EXAMPLE 28 From Example 27, we see that £ log X t and £ log (1 — X,) 

> = 1 i=I 

are jointly minimal sufficient when sampling from a beta density. //// 

Our main use of the exponential family will not be in finding sufficient 
statistics, but it will be in showing that the sufficient statistics are complete, a 
concept that is useful in obtaining “best” estimators. This concept will be 
defined in Sec. 5. 

Lest one get the impression that all parametric families belong to the 
exponential family, we remark that a family of uniform densities does not 
belong to the exponential family. In fact, any family of densities for which the 
range of the values where the density is nonnegative depends on the parameter 0 
does not belong to the exponential class. 
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5 UNBIASED ESTIMATION 

Since estimators with uniformly minimum mean-squared error rarely exist, a 
reasonable procedure is to restrict the class of estimating functions and look 
for estimators with uniformly minimum mean-squared error within the restricted 
class. One way of restricting the class of estimating functions would be to 
consider only unbiased estimators and then among the class of unbiased esti¬ 
mators search for an estimator with minimum mean-squared error. Con¬ 
sideration of unbiased estimators and the problem of finding one with uniformly 
minimum mean-squared error are to be the subjects of this section. 

According to Eq. (9) the mean-squared error of an estimator T of r(0) 
can be written as 

tJOJ ~ Wl = var* [T] + {t(0) - £ e [T]} 2 , 
and if T is an unbiased estimator of t( 0), then <S e [T] — t( 0), and so 
$ e [[T — t( 0)] 2 ] = var 9 [T], Hence, seeking an estimator with uniformly 
minimum mean-squared error among unbiased estimators is tantamount to 
seeking an estimator with uniformly minimum variance among unbiased 
estimators. 

Definition 21 Uniformly minimum-variance unbiased estimator (UMVUE) 

Let A„ ..., X n be a random sample from /(•; 0). An estimator 
T* = /*(A 1; ..., X n ) of t (0) is defined to be a uniformly minimum-variance 
unbiased estimator of t(0) if and only if (i) S g [T*] = t(0), that is, T* is 
unbiased, and (ii) var 9 [T*] < var 9 [T] for any other estimator T = 
/(Aj, ..., X n ) of t (6) which satisfies <f 9 [T] = v(0). //// 

In Subsec. 5.1 below we will derive a lower bound for the variance of 
unbiased estimators and show how it can sometimes be useful in finding an 
UMVUE. In Subsec. 5.2 we will introduce the concept of completeness and 
show how it in conjunction with sufficiency can sometimes be used to find an 
UMVUE. 

5.1 Lower Bound for Variance 

Let Aj,..., A„ be a random sample from/(•; 0), where 9 belongs to 0. Assume 
that © is a subset of the real line. Let T = J(X U A„) be an unbiased 
estimator of t( 0). We will consider the case where /(•; 0 ) is a probability 
density function; the development for discrete density functions is analogous. 
We make the following assumptions, called regularity conditions: 

® 50 lo g/(-U‘0) exists for all x and all 0. 
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(“) E 1’ ’ 1 ft/(^ dx i" dx „ 

= j j^§ ft/(*<; d ) dx i • • • <**«• 

("0 ^ J’ •' J 4*i> • • • > *„)ft f(*il Q) dx j •' • dx„ 

= J"‘J 4*i, ■ • • , x n) ^ ft Rxi, 6) dx l - dx n . 

(iv) O<4[llog/(*; 0 )| 1 < cofor all 6 in 0. 


Theorem 7 Cramer-Rao inequality Under assumptions (i) to (iv) 


above 


var e [T] > 


wm 2 


Jf e log/W«)] ' 


(15) 


n£. 


where T =/(X l , X n ) is an unbiased estimator of r(0). Equality 

prevails in Eq. (15) if and only if there exists a function, say K(6, n), 
such that 

Z ^ 0) = *(0, «)t4*i> ...,x n )~ t(0)]. (16) 

Equation (15) is called the Cramer-Rao inequality, and the right-hand side 
is called the Cramer-Rao lower bound for the variance of unbiased estimators 
of r(0). 


PROOF 


t (0) = M T(0) = ^0 I" ’ I4*1» • • • ’ Xb) n /(-**; 0) dx l • • • dx n 

= J • • • J 4 *i. • • •, *») ^ [n/(*« 5 0 )] ^ 

- 1 ( 0 ) ^ft[/(*«; 0)<**J 
= J • • • J 4*1. • • •. *») ^ [ft /(*0 e >] dx l ' ' ' 

- 40 ) J -• • [ft/(*i; 0 )] • • • dx « 
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= J- • • J[/(x„ T(0)] ~ J U f(x t ; flj • • <&„ 

= /•••/[4*1> • • • ’ x n) - T (0)] ,Q g .ft /(*<; e )] 
x ®)^ X 1 “* dx n 

i= 1 

=[[/(*„ • • •. x n ) - iQ g n/(*‘ ; e >]] • 

Now by the Cauchy-Schwarz inequality 

[t'(0)] 2 < S e [[/(X u ...,X n )~ T(6)] 2 V e [ [Jj log n /(*.•; 0 )] ]. 


or 


but 


var„ [T\ ^ 


[AO)] 2 


4 (lTe ,os R nx ‘A 


'•[[l 10 ^ f(x ‘'■ 9) ] 1 - •**[[!, I 1o8/W; e) ]’] 

= ? ? '•[[I l0g/( * li 9) ] [I e) ]] 

= ^[[Aiog/(. ;0 )] T 


using the independence of X t and X } and noting that 

* 9 [dd logf(X; =/[| l°g/(*; 0)]/(x; 0) d* 

= j^/ ( x ; 0 ) dx = ^|/(x;d)^ = ^ ( 1 ) = °. 


The inequality in the Cauchy-Schwarz inequality becomes an 
equality if and only if one function is proportional to the other; in our case 

8 " 

this requires that —log f] /(*,; 0 ) be proportional to /(x l ,x„) - r(0) 

ou i= i 

or that there exists K = K(0, n) such that 

§0 lQ g [ ft /(*«; 0)] = K( 6 , n)[/( Xl . x n )~ T( 0 )l 


llll 



318 PARAMETRIC POINT ESTIMATION 


VII 


The regularity conditions, which were stated for probability density 
functions, can be modified for discrete density functions, leaving the statement 
of the theorem unchanged. 

The theorem has two uses: First, it gives a lower bound for the variance 
of unbiased estimators. An experimenter using an unbiased estimator whose 
variance was close to the Cramer-Rao lower bound would know that he was 
using a good unbiased estimator. Second, if an unbiased estimator whose 
variance coincides with the Cramer-Rao lower bound can be found, then this 
estimator is an UMVUE. Equation (16) aids in finding an estimator whose 
variance coincides with the Cramer-Rao lower bound. In fact, if there exists a 
T* = /*(X lt ..., X n ) such that 

f jjj log/(x,; 0) = K(6, «)[/*(*!, t*(0)] 

for some functions K(0, n) and t *(6), then T* is an UMVUE of t*(0). 


EXAMPLE 29 Let X u ..., X n be a random sample from/(x; 0) = 
0e~ ex I (On oo)(x). Take t( 0) = 0. It can be shown that the regularity con¬ 
ditions are satisfied. t'( 0) = 1; hence 


var 9 [T] > 


«^[[^ log/( T ; 0)] ] 


Note that — log/(x; 0) = — (log 0 - 0x) = 1/0 - x, and so 

Ou Ou 


<?e[[^og f (X-,6) 2 = ^ ^ =var[A] = i. 


Hence, the Cramer-Rao lower bound for the variance of unbiased esti¬ 
mators of 0 is given by 


var 9 [T] 2: 


1 

MiW) 


e l 

n 


Similarly the Cramer-Rao lower bound for the variance of un¬ 
biased estimators of t(0) = 1/0 is given by 


var„ [T] > 


[t m 2 

n( 1/0 2 ) 




5 


UNBIASED ESTIMATION 319 


The left-hand side of Eq. (16) is 


?^ 1Og/(Xi;0) = ?^ (1 ° 80_0Xi)_ ?fe X ‘)“ "(*' 0)' 


By taking K(6, n) = —n and utilizing the result of Eq. (16), we see that 
X n is an UMVUE of 1/6 since its variance coincides with the Cramer-Rao 
lower bound. //// 


EXAMPLE 30 Let X u ..., X n be a random sample from/(x; 6) =f(x; A) = 
e~ x X x /x\ for x = 0, 1, 2, .... 


8 . .. 8 . e~ x X x 




d x 

= — (-A + x log A - log x!) = -1 + - • 


Therefore 


•»[[|iog/(*; 1)]*] = / [(f - i)’] - jr - VI 


1 r i 1 1 

= ^2 var [X] = -^ A = -, 


and so the denominator of the Cramer-Rao lower bound is n/X. Now, 
if r(A) = e~ x = P[X = 0], then the Cramer-Rao lower bound for the 
variance of unbiased estimators of t(A) = e~ x is given by var [T] > 

n 

Xe~ lx jn. Note that T = (1 /ri) £ /^(A^) is an unbiased estimator of t(A) 

i= 1 

= <T A since £[T] = (1/n) £ £[I {0) (Xfi = (1/u) £ e~ x = e~ x . I {0) (X t ) = 1 

<=i i=i 

if X s = 0, and f lo) (X,)=0 otherwise; so S[I W (X$\ = 1 • P[X t = 0] 
+ 0 ■ P[X t # 0] = e x . T is the proportion of observations in the sample 
that are equal to 0. var [T] = {\/ri)e~\\ - e ~ x ), as compared to the 
Cramer-Rao lower bound, which is (1 jn)Xe~ 21 . Note that 

(l/ri)e~ x (l - e ~ x ) > (1 /n)Xe~ 2X , 

as it should be. An UMVUE of t(k) = e~ x is found in Example 34. 

We note that £ (8/8X) log/(x,; A) =£(-l + xJX) = ( n/X)(x - A); 
hence, X is the UMVUE of A by Eq. (16). //// 
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In general, the Cramer-Rao lower bound is not an attainable lower bound; 
that is, there often exists a lower bound for variance that is greater than the 
Cramer-Rao lower bound. We will see several such examples in Subsec. 5.2 
below. We will see that an UMVUE can exist whose variance does not coincide 
with the Cramer-Rao lower bound. 

We conclude this subsection with several remarks, the statements of which 
are not necessarily mathematically precise. All the same, the remarks are 
important and do relate some earlier concepts to the Cramer-Rao lower bound. 

Remark Under certain assumptions involving the existence of second 
derivatives and the validity of interchanging the order of certain differen¬ 
tiations and integrations, 

[ [to log/( 0) ] 2 ] = *• U 0 )] • //// 

This remark is computationally useful if the first expectation is more 
difficult to obtain than the second. The proof is left as an exercise. 

Remark If the maximum-likelihood estimate of 0, say 0 = 9(x 1 . x n ), 

is given by a solution to the equation 

^ log L(0; x u ..., x n ) = gjj log fl /(x ( ; 0) = 0, 

and if T* = /*(X l , ..., X„ ) is an unbiased estimator of t*(0) whose 
variance coincides with the Cramer-Rao lower bound, then ..., x n ) = 
t*(5(*i . x n j). 

PROOF 

0 = 4 !°g ft /(*«; 0) = K(e, ri)U\x u ..., X„) - T *m 

vv i=i e-9 

by Eq. (16) and the definition of 0. //// 

This remark tells us that under the conditions of the remark a maximum- 
likelihood estimator is an UMVUE! 

Remark If T* = J*(X t , ..., X n ) is an unbiased estimator of some 
t*(0) whose variance coincides with the Cramer-Rao lower bound, then 
/(•; 0) is a member of the exponential class; and, conversely, if/(*; 0) 
is a member of the exponential class, then there exists an unbiased esti¬ 
mator, say T*, of some function, say t*(0), whose variance coincides with 
the Cramer-Rao lower bound. //// 
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We will omit the proof of this remark. It relates the Cramer-Rao lower 
bound to the exponential family; in fact, it tells us that we will be able to find 
an estimator whose variance coincides with the Cramer-Rao lower bound if 
and only if the density from which we are sampling is a member of the expo¬ 
nential class. Although the remark does not explicitly so state, the following is 
true: There is essentially only one function (one function and then any linear 
function of the one function) of the parameter for which there exists an unbiased 
estimator whose variance coincides with the Cramer-Rao lower bound. So, 
what this remark and the comments following it really tell is: The Cramer- 
Rao lower bound is of limited use in finding UMVUEs. It is useful only if we 
sample from a member of the one-parameter exponential family, and even then 
it is useful in finding the UMVUE of only one function of the parameter. 
Hence, it behooves us to search for other techniques for finding UMVUEs, and 
that is what we do in the next subsection. 

5.2 Sufficiency and Completeness 

In this subsection we will continue our search for UMVUEs. Our first result 
will show how sufficiency aids in this search. Loosely speaking, an unbiased 
estimator which is a function of sufficient statistics has smaller variance than an 
unbiased estimator which is not based on sufficient statistics. In fact, let 
/(•; 9) be the density from which we can sample, and suppose that we want to 
estimate t( 0). Let us assume that T = /(X,, ..., X„) is an unbiased estimator 
of z(9) and that S = o(X u ..., X n ) is a sufficient statistic. It can be shown that 
another unbiased estimator, denoted by T', can be derived from T such that 
(i) T is a function of the sufficient statistic S and (ii) T' is an unbiased estimator 
of z(9) with variance less than or equal to the variance of T. Therefore, in our 
search for UMVUEs we need to consider only unbiased estimators that are 
functions of sufficient statistics. We shall formalize these ideas in the following 
theorem. 

Theorem 8 Rao-Blackwell Let X k , ..., X n be a random sample from 
the density /(•; 9), and let S t = ai {X u ..., X n ), ..., S k = o k {X u ...,X n ) 
be a set of jointly sufficient statistics. Let the statistic T = /(X u ..., X„) 
be an unbiased estimator of t (9). Define T' by T' = <f[T\S u ..., SJ. 
Then, 

(i) T is a statistic, and it is a function of the sufficient statistics 

Si, Write T’ = J'(S U ...,S k ). 

(ii) <?e[T ] = t(0); that is, T' is an unbiased estimator of t (9). 

(m) var„ [T'J < var 9 [T] for every 9, and var @ [T ] < var 9 [T] for 

some 9 unless Tis equal to T' with probability 1. 
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proof (i) S u S k are sufficient statistics; so the conditional 
distribution of any statistic, in particular the statistic T, given S u ..., S k 
is independent of 8; hence T' — d>[T\S ly ..., S*] is independent of 9, and 
so T' is a statistic which is obviously a function of S l5 . S k . 
(ii) S e [T'] = & t [&[T\S lt S*]] = S e [T ] = t( 0) [using Eq. (26) of 
Chap. IV]. (iii) We can write 

varJT] = $ e [(T— £ e [T']f] =S e [{T- T +T - S e [T']) 2 ] 

= £ e [(T— T') 2 ] + 2<g e [{T— T')(T' - S e [T'])} + var 9 [T 1 ]. 

But 

*»{{T - T'){T - £ e [T'])] = - T' )(T' - &n[T']) S k ]], 

and 

£[{T - T')(r - ^[T'DI^ = Sl ,...;S k =s k ] 

= {/'(s, d e [T'm(T -T')\S 1 =s 1 ;-..;S k =s k ] 

= {A«i. ■■.,s k )-<?e[T'W[T\S l £* = **] 

— <$[T' |Si = S k = s*]) 

= • • • , S k ) - ^ # [T']}[^'(J,,..., s k ) - /'(*,.5*)] 

= 0 , 

and therefore 

var 9 [71 = g t [(T - T') 2 ] + var 9 [V ] > var 9 [T'J. 

Note that var 9 [T] > var 9 [T'J unless T equals T' with probability 1. //// 

For many applications (particularly where the density involved has only 
one unknown parameter) there will exist a single sufficient statistic, say 
S = o(X 1 , .X„), which would then be used in place of the jointly sufficient 
set of statistics S k , S k . What the theorem says is that, given an unbiased 
estimator, another unbiased estimator that is a function of sufficient statistics 
can be derived and it will not have larger variance. To find the derived statistic, 
the calculation of a conditional expectation, which may or may not be easy, is 
required. 


EXAMPLE 31 Let X u ..., X„ be a random sample from the Bernoulli density 
f(x; 9) = 0*(1 - 0) 1- * for x = 0 or 1. X k is an unbiased estimator of 
i(9) = 9. We use X k as T = t(X k , X n )in the above theorem. Y. %i 
is a sufficient statistic; so we use S = £ X t as our set (of one element) of 
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sufficient statistics. According to the above theorem T' = i[T\S] = 
W |£ Xi\ is an unbiased estimator of 9 with no larger variance than 
T _ x l . Let us evaluate S[T\S]. We first find the conditional distribu¬ 
tion of Xi given £ = s. takes on at most the tw0 values 0 and L 

pUi-0;£x, = *| 

p[Xi= 011^ = 5]= 1 r . ,Ml , J 

WH 

p[*i = 0; I/< = s] 

PIJfi-03-p^Jf^s] (i“ 0 >(”7- 0 ) n_1 " s n _ s 
P £X t = s Q^(l-0)"- s " 



We note in passing that the conditional distribution of X t given £ X t = s 
is independent of 9, as it should be. Also, we could have derived the 
conditional distribution with much less effort by asking: Given that you 
have observed s successes in n trials, what is the probability that the 
first trial resulted in a success ? This probability is s/n. (See Example 28 
in Chap. I.) 

^klt jr «= s l=o- —+ 1-- = -; 

L i=i J n n n 
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hence, 

n 

I 

j, _ i — 1 

n 


The variance of X l is 0(1 — 0), and the variance of T' is 0(1 — 9)/n; so 
for n > 1 the variance of T' is actually smaller than the variance of 

T = *i- //// 


Before leaving Theorem 8, two comments are appropriate: First, if the 
unbiased estimator T is already a function of only , ..., S k , then the derived 
statistic T' will be identical to T, and hence no improvement in variance can be 
expected. Second, although the set of jointly sufficient statistics is an arbitrary 
set, in practice one would naturally use a minimal set of jointly sufficient statistics 
if such were available. 

Theorem 8 tells us how to improve on an unbiased estimator by con¬ 
ditioning on sufficient statistics. For some estimation problems this unbiased 
estimator, obtained by conditioning on sufficient statistics, will be an UMVUE. 
To aid in identifying those estimation problems for which a derived estimator 
is an UMVUE, the concept of completeness of a family of densities is useful. 

Definition 22 Complete family of densities Let X t , ..., X„ denote a 
random sample from the density /(•; 0) with parameter space 0, and let 
T = /(Xi , ..., X n ) be a statistic. The family of densities of T is defined 
to be complete if and only if <g e [z(T )] = 0 for all 0e© implies that 
P e [x(T) = 0] = 1 for all 0 e 0, where x{T) is a statistic. Also, the statistic 
Tis said to be complete if and only if its family of densities is complete. 

//// 


Another way of stating that a statistic T is complete is the following: 
Tis complete if and only if the only unbiased estimator of 0 that is a function of 
T is the statistic that is identically 0 with probability 1. 


EXAMPLE 32 Let X u X n be a random sample from the Bernoulli 
density. The statistic T = X t — X 2 is not complete since S’ e [X l — X 2 ] = 0 

n 

and X { — X 2 is not 0 with probability 1. Consider the statistic T = £ X t . 

i 

Let x(T) be any statistic that is a function of T for which <g e [%(T)) = 0 
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for all 0 6 0, that is, for 0 < 0 < 1. To argue that Tis complete, we must 
show that x(f) = 0 for t = 0, l, n. Now 


WT )]=- W; 

hence, <? e [x(T)] = 0 for all 0 < 0 < 1 implies that 


Z 40 

f = 0 


1 - 0 , 




= 0 


or 


140 

< = 0 


a‘ = 0 


for all a, where a = 0/(1 — 0). Now in order for a polynomial in a to be 
identically 0, each coefficient of a‘, t = 0, n, must be 0; that is, 

= 0 for t = 0. n, but # 0; so *(0 = 0 for t = 0. n. //// 


EXAMPLE 33 Let X u X n be a random sample from the uniform distri¬ 
bution over the interval (0, 0), where S = {0: 0 > 0}. Show that the 
statistic y„ is complete. We must show that if £ 9 [x(Y„)] = 0 for all 0 > 0, 
then P e [x( Y n) = 0] = 1 for all 0 > 0. 

= J *(y)fy„(y) dy = j o 4y)0~ n ny n ~ l dy, 

and = 0 for all 0 > 0 when and only when 

n r 6 

™ 4y)y" -1 dy =0 for all 0 > 0 

O J Q 

or 

r e 

x(y)y n 1 dy=0 for all 0 > 0. 

‘'O 

Differentiating both sides of this last identity with respect to 0 produces 
a< 0)0"" 1 = 0, which in turn implies that «(0) = 0 for 0 > 0. //// 

In general, demonstrating completeness can require tricky analysis. The 
two above examples are exceptions. We state now, without proof, a theorem 
that gives us completeness of a statistic. It will be our main tool for arguing 
completeness. 
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Theorem 9 Let X u . . ., X n be a random sample from/(- ;9),9 e ©, where 
0 is an interval (possibly infinite). If f(x; 6) = a(9)b(x) exp [e(9)d(x)\, 
that is,/(•; 0) is a member of the one-parameter exponential family, 
then d(Xi) is a complete minimal sufficient statistic. //// 

Theorem 9 shows once again the importance of the exponential family or 
exponential class. We are finally adequately prepared to state the theorem that 
is useful in finding UMVUEs. 

Theorem 10 Lehmann-SchefFe Let Xi, ..., X n be a random sample 
from a density /(•; 9). If S = o(X x , ..., X n ) is a complete sufficient 
statistic and if T* = J*(S), a function of S, is an unbiased estimator of 
x(8), then T* is an UMVUE of x(8). 

proof Let T' be any unbiased estimator of x(0) which is a function 
of S; that is, T = t'(S). Then £ e [T* - T']e=0 for all 0e§, and 
T* — T is a function of S; so by completeness of S, P 0 U*(S) = /'(S)] = 1 
for all fleS. Hence there is only one unbiased estimator of x(6) that 
is a function of S. Now let T be any unbiased estimator of r(0). T* 
must be equal to <f[T|S] since <?[T|S] is an unbiased estimator of x(0) 
depending on S. By Theorem 8, var„ [T*] < var 9 [T] for all 8 e (5; so 
T* is an UMVUE. //// 

Let us review what this important theorem says: First, if a complete 
sufficient statistic S exists and if there is an unbiased estimator for x(9), then 
there is an UMVUE for x(9); second, the UMVUE is the unique unbiased 
estimator of x(9) which is a function of S. 

To actually find that unbiased estimator of r(0) which is a function of S, 
we have several ways of proceeding. First, simply guess the correct form of the 
function of S that defines the desired estimator. Second, guess or find any 
unbiased estimator of x(8), and then calculate the conditional expectation of 
the unbiased estimator given the sufficient statistic. Third, solve for /*(•) 
in the equation <f e [/*(S)] = x(8). Such an equation becomes the integral 
equation J /*(s)f s (s) ds = x(9) in the case of a continuous random variable S 
and becomes the summation ]T /*(s)f s (s) = r(0)forSa discrete random variable. 
We will employ two of these methods in the following examples. 


EXAMPLE 34 Let X u 


X n be a random sample from the Poisson density 


/(*; = 




JC 


x! 


for x = 0, 1,.... 
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We saw in Example 25 that f(x; X) belongs to the exponential family with 
_ x By Theorem 9, the statistic £ X ( is complete and sufficient. 
To find the UMVUE of X itself, it suffices to guess a function of £ x, 
whose expectation is X. Noting that X is the population mean, (1/m) £ X t 
is the obvious choice; so (1 /«) I UMVUE of X. 

Consider now estimating t(2) = e~ x - P[X t = 0]. (Recall Ex¬ 
ample 30.) Let us derive the UMVUE of e~ x by calculating the condi¬ 
tional expectation of some unbiased estimator given the sufficient statistic. 
Any unbiased estimator will do as the preliminary estimator whose 
conditional expectation needs to be calculated; so we may as well choose 
one that would make the calculations easy. / {0 }(*i) is an unbiased estimator 
of e~ x and is relatively simple since it can assume only the two values 0 
and 1. By Theorem 10, <^[/{ 0 }(2fi)II * s the UMVUE of e~ x . To find 
the desired conditional expectation, we first find the conditional distribu¬ 
tion of X l given X t . 


L f[t j 1 

. 1 


p\x l = 0;£ AT { = si P[X i = 0]J>[£ X, = s 

_ __2_J _ _ 12 _. 

p\t *« = s| p \t x i = s 

i J L i 


c -a c“("-i) a [(m - l)AJ7s! 




e~ n \nXy/sl 

Therefore, 

*lho }(Xi) II = s] = P[X, = 0 |£ X, = s] = (^) S ; 


for n > 1. 


hence 


(^) 


1\“* 


is the UMVUE of e~ x for n > 1. For n = 1, / {0) (Zj) is an unbiased 
estimator which is a function of the complete sufficient statistic X u and 
hence / {0) (*i) itself is the UMVUE of e~ x . The reader may want to derive 
the mean and variance of 


(^r 



328 PARAMETRIC POINT ESTIMATION 


VII 


and compare them with the mean and variance of the estimator 
(1/n) £ I {0) {X/} given in Example 30. //// 

i — 1 


EXAMPLE 35 Let X t , X n be a random sample from /(x; 6) = 
6e~ 9x I( 0i rj0) (x). Our object is to find the LFMVUE of each of the following 
functions of the parameter 9: 9, 1/9, and e~ Ke = P[X > X] for given K. 
Since 9e~ 9x I (0t rj0) (x) is a member of the exponential class (see Example 24), 

n 

the statistic S = £ X t is complete and sufficient. 
i 

n 

X n = (1/n) £ which is a function of the complete sufficient 

i = 1 
n 

statistic S = X,, is an unbiased estimator of 1/9; hence by Theorem 10, 

i — 1 

X n is the UMVUE of 1/9. 

To find the UMVUE of 9, one might suspect that the estimator is 

n 

of the form c/£ X h where c is a constant which may depend on n. Now 
i 


- ■ c C ^ <s) * me C\ i4 sv " v * 

1 r 00 cfi . 

= c —- 9V 2 e 9s ds = —— u" 2 e “ du 
t(m) Jq r(w) *^o 


c9 

T(«) 


• T(n - 1) = 


c9 

n-1 


for n > 1. So * # [c/£ X t ] = 9 when c = n — 1; hence (n — 1)/]T X t is the 
UMVUE of 9 for n> 1 . The variance of (n — 1)/]T X t is given by 
9 2 /(n — 2) for n > 2. 

Although one might be able to guess which function of S = £ X t 
is an unbiased estimator for e~ Ke , let us derive the desired estimator by 
starting with the following simple unbiased estimator of e~ Ke : I (Ki oo) (A r 1 ). 
Note that S e [I (KiOD) {X{)] = 0 • P[X t < K] + 1 • P[X { > K] = P[X t > K] = 
e~ Ke ; so I (Ki 00 )(X 1 ) is indeed an unbiased estimator of e~ K9 , and therefore 
by Theorems 8 and 10 S e [I (Kt ^(A^) | S] is the UMVUE of e~ K9 . Now, 
$ oo)(-^i) I & = = P[I(k, oo)(Xi) = 11 S = j] = P[X 1 > K | S = s]. In 

order to obtain P[Xi > K \ S = s], we will first find the conditional distribu¬ 
tion of X t given S = s. 
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/xi|S=«( x ll S ) 

_ fx i.sC*!’ s ) ^Xl 

m As 


n 

P Xi<X t <x t + AXi ; s < £ X ( < s + As 

_i_ 


[l/r(«)]0V~V 85 As 


p|x x < X t < x t + Ax t ; s - X! < E X t < s - x t + Asj 

,n-l „-#J 


'[* 


[i/r(w)]0V"V' B5 As 

P[x t < X l < Xi + Ax^P^s — Xi < £ X t < s — Xi + Asj 

= [i/r(»>]e"s"*‘p 9s As 

_ 0£ _9jc ‘[l/r(« - l)]0 n_1 (s - X i y~ 2 e~ 9il ~ x ° Ax t As 
~ [l/r(«)]0V -1 i>- 9s As 

_ . (*- x ir 2 


r(« — i) s' 

for x t < s and n > 1. 


,n— 1 


Axi 


$eU(K, oo)(-^i)l S — s] — P[Xi > K|S — s] — f /x,|s= s ( x i 1$) dxi 

J K 

r («) (s-Xi )"- 2 


J K 


r(n-l) S n_1 
n - 1 r s . 


£?X! 


1, 

vs 

a 

1 

J K 

n - 1 

r° 

~n — 1 

J / _2 (- 

s 

J s-K 

_ n — 1 

y"~ 1 S-K 

s n-l 

n — 1 o 

II 

VS 

1 

a 

1 


for s > K and n > 1, where the substitution y = s - Xl was made. Hence, 

fE*.~ K\”~' 


Wirt 


• V^E^) 
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is the UMVTJE for e~ Ke for n > 1. (Actually the estimator is applicable 
for n = 1 as well.) It may be of interest and would serve as a check to 
verify directly that 


is unbiased. 


where the substitution u = s — K was made. //// 

In closing this section on unbiased estimation, we make several remarks. 

Remark For some functions of the parameter there is no unbiased 
estimator. For example, in a sample of size 1 from a binomial density 
there is no unbiased estimator for 1/8. Suppose there were; let T = t(X) 

denote it. Then £ e [T] = £ /(x) (”) 0*(1 — 0)"~ x = 1/9. which says that 

an nth-degree polynomial in 8 is identical to 1/6, which cannot be. //// 

Remark We mentioned in Subsec. 5.1 that the Cramer-Rao lower bound 
is not necessarily the best lower bound. For example, the Cramer-Rao 
lower bound for the variance of unbiased estimators of 9 in sampling 
from the negative exponential distribution is given by 9 2 /n (see Example 29), 
and the variance of the UMVUE of 8 is given by 9 2 /(n — 2) (see Ex¬ 
ample 35). 9 2 /(n — 2) is necessarily the best lower bound. //// 

Remark For some estimation problems there is an unbiased estimator 
but no UMVUE. Consider the following example. //// 



EXAMPLE 36 Let X u ..., X n be a random sample from the uniform density 
over the interval (9, 0+1]. We want to estimate 8. X„ — } and 
(Y l + y„)/2 — i are unbiased estimators of 8, yet there is no UMVUE 
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of 8. For fixed 0 < p < 1, consider the estimator g(X { - p) + p, where 
the function g(y) is defined to be the greatest integer less than y. Now 


£[g(X l -p) + p] 

.#+1 .0+1 —p 

= I 9(x i - p) dx 1 + p = g(y ) dy + p. 

Jf J »-p 

For fixed 0 and p, there exists an integer, say N = N(8, p), satisfying 
d-p<N<d + l-p- Hence, 


£[g(x 1 -p) + p] 

.0+1 -P r N r 0+l-P 

= f g(y) dy + p = \ (N — 1) dy + N dy + p — 9. 

J 0-p J 0-p J N 

So g(X t — p) + p is an unbiased estimator of 9. Moreover, if 9 + 1 — p 
is an integer, say J, then g(x l - p) = J - l for all x t satisfying 
8-p<x 1 -p<9+l-p; so g(x t -p) + p = J-l+p = 9+l-p 
- 1 + p = 9 for all 9 < x y < 9 + 1; that is, g(X t - p) + p estimates 9 with 
no error, and hence has zero variance for 9 4- 1 — p equal to any integer. 
So, we have an estimator, namely g(X , - p) + p, which has zero variance 
for 9 = any integer — 1 + p. But 0 < p < 1 is arbitrary; so for any fixed 
9, say 9 0 , we can find an unbiased estimator of 9 which has zero variance 
at 9 0 . Hence, in order for an estimator to be the UMVUE of 8, it must 
have zero variance for all 8; that is, it must always estimate 8 without 
error. Clearly, no such estimator exists. {The reader may wish to show 
that var [g(X v - p) + p] = [N - (9 - />)][(0 + 1 - p) — N], where N = 
N(8, p) is an integer satisfying 9 — p<N<9 + 1 - p.} //// 


Remark It is sometimes possible to find an UMVUE even when a 
minimal sufficient statistic is not complete. See Prob. 11, p. 313, in 
Rao [17]. mi 


6 LOCATION OR SCALE INVARIANCE 

In the last section we employed the property of unbiasedness as a means of 
restricting the class of estimators with the hope of finding an estimator having 
minimum mean-squared error within the restricted class. In this section we will 
indicate how an alternative property, the property of invariance, can be used to 
restrict the class of estimators. Our discussion will be limited to only two types 
of invariance, namely, location invariance and scale invariance; a fuller discus¬ 
sion, which is beyond the scope of this book, can be found in Refs. [12] and [19]. 
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6.1 Location Invariance 

If the observations X t , X n represented measurements of some sort and the 
parameter being estimated was also measured in the same units, one might 
reasonably require that an estimator /(■, ..., •) satisfy the property 

/(Xi + c, x 2 + c, ..., x n + c) = /(x l5 ..., x„) + c for every constant c. The 
idea is that if a constant c is added to each of the measurements x ls x„, 
then the estimator evaluated at the adjusted measurements Xj + c, x„ + c 
ought to adjust the estimated values /(x,,..., x„) by adding the same constant 
to it. For example, suppose that it is desired to estimate the average weight 
of a group of pigs when the only method available for weighing is for a person 
to stand on a scale holding a pig; so both the pig and person are weighed. If 
one person were to hold the pigs, the measurements (weights) x x + c, ..., 
x„ + c would be obtained, where x ; is the weight of the zth pig and cis the person’s 
weight. If, on the other hand, someone else were to hold the pigs, the measure¬ 
ments x x + c', ..., x„ 4 - c' would be obtained, where c' is the other person’s 
weight. It seems reasonable that the estimate of the average weight of the group 
of pigs obtained should not depend on which person held the pigs; that is, the 
estimate should not vary with c, the weight of the pig holder. We define a 
location-invariant estimator accordingly. 

Definition 23 Location invariant An estimator T = t(X x , ..., X„) is 
defined to be location-invariant if and only if /(x t + c, ..., x„ + c) = 
/(x 1; ..., x„) + c for all values Xj, ..., x„ and all c. //// 

A number of the estimators that we have encountered are location- 
invariant, for example, X n and (Ij + Y„)/2, as the following shows: 

y (x f + c) y x { 

/(x t + c,..., x„ + c) = ^+ c = t{x u ..., x„) + c 

n n 

for /(xj, ..., x„) = x„; and 

/(Xj -he,..., X„ 4" c) 

_ min [x t 4- c ,..., x„ 4- c] 4- max [x t + c,..., x„ + c] 

2 

_ min [x t> ...,x„] + c + max [x u .. ■, x J 4- c 
2 

min [x u ..., x„] 4- max [x u ...,xj , 

_ 2 

= t(x x ,..., x„) 4- c 
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for t(x u ..., x„) = 0>i + jO/ 2. On the other hand, quite a number of 
estimators are not location-invariant; for example, S 2 and Y„ - Y u as the fol¬ 
lowing shows; Take T = J(X lt ..., X„) = S = ^ (X t — X n ) l(n — 1), then 
/( Xi + c, ..., x„ + c) = £ [x ; + c - £ (*i + c)/n] 2 l(n - 1) = Y(x u ..., x„), in¬ 
stead of /(x 1; xj + c. Now take T = J(X U ..., X n )= Y n - y t ; then 
Y(x! + c, ..., x„ + c) = max [xj + c, ..., x„ + c\ - min [x t + c, ..., x„ + c] = 
max [x l5 ..., x„] + c - min [x t , ..., x„] - c = t{x u ..., x„), instead of 
/(Xi, ..., x„) + c. 

Our use of location invariance will be similar to our use of unbiasedness. 
We will restrict ourselves to looking at location-invariant estimators and seek 
an estimator within the class of location-invariant estimators that has uniformly 
smallest mean-squared error. The property of location invariance is intuitively 
appealing and turns out also to be practically appealing if the parameter we are 
estimating represents location. 

Definition 24 Location parameter Let {/(• ; 9), 0 6 0} be a family 
of densities indexed by a parameter 6, where 0 is the real line. The param¬ 
eter 6 is defined to be a location parameter if and only if the density 
/(x; 0) can be written as a function of x - 0; that is,/(x; 0) — h(x — 0) 
for some function h( •). Equivalently, 0 is a location parameter for the 
density f x (x; 0) of a random variable X if and only if the distribution of 
X — 0 does not depend on 0. //// 

We note that if 0 is a location parameter for the family of densities 
{/(• ; 0), 0 e 0}, then the function h{ • ) of the definition is a density function 
given by h( • ) =/( • ; 0). 


EXAMPLE 37 We will give examples of several different location parameters. 
If/(x; 0) = <j>e, i(x), then 0 is a location parameter since 

i<*) =y= ex p[- \ (* - A) 2 ] = *o.i(* - 0). 

Or if X is distributed normally with mean 0 and variance 1, then X - 0 
has a standard normal distribution; hence the distribution of A'-0 is 
independent of 0. 

'If/(x; 0) — I(e-i,» + *)(*)> then 0 is a location parameter since 
f(x ; 0) = I(e -i, e + i)(x) = f(-j, 4) (x — 0), a function of x — 0. 

lifix', a) = 1 /tc[ 1 + (x — a) 2 ], then a is a location parameter since 
fix; a) is a function of x — a. //// 
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We will now state, without proof, a theorem that gives within the class 
of location-invariant estimators the uniformly smallest mean-squared error 
estimator of a location parameter. The theorem is from Pitman [41], 

Theorem 11 Let X t ,, X„ denote a random sample from the density 
/(• ; 0), where 0 is a location parameter and 0 is the real line. The 
estimator 


jef\f(x t -e) do 

S(X 1 ,...,XJ = —^ - (17) 

fn f{X i -,6)dB 

J (=i 

is the estimator of 0 which has uniformly smallest mean-squared error 
within the class of location-invariant estimators. //// 

Definition 25 Pitman estimator for location The estimator given in 
Eq. (17) is defined to be the Pitman estimator for location. //// 

According to the formula given in Eq. (17), determining the Pitman 
estimator requires evaluating the integrals given in the numerator and denomi¬ 
nator; such evaluation may not be easy. Note that the integration is with 
respect to the parameter; so the resulting ratio will be a function of X x ,..., X n . 


EXAMPLE 38 Let X t ,..., X n be a random sample from a normal distribution 
with mean 0 and variance unity. We saw in Example 37 that 0 is a loca¬ 
tion parameter. Our object is to find the Pitman estimator of 0, which is 
given by Eq. (17). In the following series of equalities one should be 
forewarned that cancellations and insertions are being made simultaneously 
in the numerator and denominator. 

j efl K i(Ii) dB J 0(1/72^)" exp [-± £ (X, - B) 2 ] dd 

jn <t>e,i( x ddB | (\j y j2n'f exp dB 

1 8 exp [ - (n/2)B 2 + 0 X ^'] dd 

J exp [— (n/2)0 2 + 0 X ^;] dB 
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J 9 exp [ — (n/2)(6 — X „) 2 ] d9 
Jexp [—(n/2)(6 — X„) 2 ] dO 

jd\l/y/Ml/ V»)]exp{-iP - X n )l(\lJ~n)] 2 }de 
/[1/V^c(l/V^)] exp{-i[(0 - X„)/(l/V»)] 2 } dO 

= X n 

by noting that the last denominator is just the integral of a normal density 
with mean X n and variance 1/n and hence is unity, and the last numerator 
is the mean of this same normal density and hence is X n . 

We note that, for this example, the Pitman estimator of 9, which is 
uniformly minimum mean-squared error among location-invariant esti¬ 
mators, is identical to the UMVUE of 9; that is, the estimator that 
is best among location-invariant estimators is also best among unbiased 
estimators. //// 


EXAMPLE 39 Let X t , X n be a random sample from a uniform distribu¬ 
tion over the interval (6 — 9 + i). According to Example 37, 9 is a 

location parameter. The Pitman estimator of 9 is 



S d tu<x l - i ,x l+i) (e)de 

1 

<33 

13 

>< 

+ 

1 

J FI I(Xi-hXi+±)(fy dO 

J i= 1 


_ 1 (Y 1+ j) 2 -(Y n -i) 2 _ r, + Y„ 

Sk-td9 2 (Y 1+ i)-(Y m -i) 2 • 

Recall that for this example there is no UMVUE of 9. (See Example 36.) 

I/ll 

Remark A Pitman estimator for location is a function of sufficient 
statistics. 

proof If S x = ^(X u ..., X„), ..., S k = o k (X u ..., X a ) is a set of 
sufficient statistics, then by the factorization criterion JT /(x ; ; 9) = 
g(s lt s k ; 9)h(x l ,..., x n ); so the Pitman estimator can be written as 
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| 9 n f(Xi\0) do 1 9g(S u ...,S k ; 9)h{X u ...,X n )dO 
J ft /(*«; dO jg(S u ...,S k ; 9)h(X u ...,X n )dO 
f0g(S 1 ,...,S k ;9)d0 
j g(S 1 ,...,S k -,0)d0 

which is a function of S t , S k . //// 

6.2 Scale Invariance 

For those experiments in which measurements can be made in different units, 
such as length being measured in either inches or centimeters, weight being 
measured in either pounds or kilograms, or volume being measured in either 
quarts or liters, one might reasonably require that his statistical procedure be 
independent of the measurement units employed. If the statistical procedure 
is that of point estimation, then one might require that the estimator that is to 
be used satisfy the property of scale invariance defined below. The idea is that 
an estimator will be scale-invariant if the estimator does not depend on the scale 
of the measurement. 

Definition 26 Scale-invariant An estimator T = J(X k , ..., X n ) is 
defined to be scale-invariant if and only if /(cx l ,.cx n ) — c/(x t , . x„) 
for all values x t , x„ and all c > o. //// 

A number of the estimators that we have considered are scale-invariant, 
including X n , y/§ 2 , (Y k + Y n )/2, and Y n —Y 1 . Our discussion of scale- 
invariant estimators will be limited to problems concerning estimation of scale 
parameters defined below. 

Definition 27 Scale parameter Let {/f • ; 0), 6 > 0} be a family of 
densities indexed by a real parameter 6. The parameter 0 is defined to be a 
scale parameter if and only if the density f(x ; 0) can be written as (1 /0)h(x/9) 
for some density h( ■). Equivalently, 0 is a scale parameter for the 
density f x (x; 0) of a random variable X if and only if the distribution of 
X/6 is independent of 6. //// 

Note that if 6 is a scale parameter for the family of densities {/(• ;9),9> 0}, 
then the density h{ • ) of the definition is given by h(x ) = f(x; 1). 
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EXAMPLE 40 We give several examples of scale parameters. If f{x; X) = 
(l/l)e~ x,x I (0 «,)(*)> then A is a scale parameter since e~ y I (0 a3) (y) is a 
density. Note that this parameterization of the negative exponential 
distribution is not the parameterization that we have used previously. 

If 

Ax-,e > = 4> 0 'Ax')=j^tx p[-|(|) ]» 

then c is a scale parameter since (l/^/27r) exp (— iy 2 ) is a density. 

If f(x; 0) = (l/0)/ (O ,»>(*) = OWo.dW 0 )- th en 0 is a scale param¬ 
eter since ho, ify) is a density. 

If f(x; 0) = (l/0)I (e , 2 «)W = (l/0)Ai, 2)(x/0), then 0 is a scale param¬ 
eter since / ( i, 2 )00 is a density. //// 


Our sole result for scale invariance, a result that is comparable to the 
result of Theorem 11 on location invariance, requires a slightly different frame¬ 
work. Instead of measuring error with squared-error loss function we measure 
it with the loss function S(t; 6) — (t — 0) 2 /0 2 =(t/0 — l) 2 . If \t — 0\ represents 

error, then 1001 t — 0\/0 can be thought of as percent error, and then (t — 0) 2 /0 2 

is proportional to percent error squared. We state the following theorem, also 
from Pitman [41], without proof. 

Theorem 12 Let . ,,X n be a random sample from the density /(• ; 0), 
where 0 > 0 is a scale parameter. Assume that f(x ; 0) = 0 for x < 0; 
that is, the random variables X t assume only positive values. Within 
the class of scale-invariant estimators, the estimator 

J (l/ 0 2 >n f(X i ; 0 )d 0 

«X» • • •, *„) = ^- ( 18 ) 

(l/0 3 )Tlf( x i’8)M 

J 0 i=l 

has uniformly smallest risk for the loss function £{t\ 0) = (t - 6) 2 /6 2 . //// 

* 

Definition 28 Pitman estimator for scale The estimator given in Eq. 
(18) is defined to be the Pitman estimator for scale. //// 

Remark The Pitman estimator for scale is a function of sufficient 
statistics. //// 



338 PARAMETRIC POINT ESTIMATION 


VII 


EXAMPLE 41 Let X x , X„ be a random sample from a density f(x; 9) = 
(1 /9)I ^ 0 The Pitman estimator for the scale parameter 9 is 

-oo « poo 

f (WridWo,^)^ 9-»- 2 de 

_0 _ i=_l _ __ _ 

-00 n 00 

f (i/e^namo.eWde o~- 3 de 

J 0 i ~ 1 J Y » 

_ (!/[(« + 2) - l]}y-(" +2)+1 _ n + 2 
{!/[(» + 3) - l]}y- (n+3 > +1 n + 1 "' 

We know that Y n is a complete sufficient statistic and S[Y n ] = [n/(n + 1)]0; 
so by the Lehmann-Scheffe theorem [(« + l)/«]y„ is the UMVUE of 9. 

nn 


EXAMPLE 42 Let X x , ..., X n be a random sample from the density f(x ; A) = 
(1/A) exp (—jc/A)/ (0j «,)(•*:)• The Pitman estimator for the scale param¬ 
eter A is 

f "(IM 2 ) n f(Xi ; 2) dX f °°(1/A" +2 )exp(-L XJX)dX 

J _0 _ i =_1 _ _ ^0 _ 

fo/A 3 ) fl/(Ar i; A)^A f °°(l/A' ,+3 )exp( — £ XJX)dX 

J 0 i - 1 J 0 

f 0O («/l2f i ) n + 2 c-'(IX i /a 2 )da 
__ ^0 _ 

J o 

- r”° 

a"e a da. 

= yx. 1°_ 

Lu * - co 

<x n+ 1 e~ a doc 
J o 

r(» +1) = l£i 

^ * r(n + 2 ) n + r 

(It can be shown that the UMVUE of A is X-Jn.) 

Note that £ X-Jn is a scale-invariant estimator, and, hence, since 
£ A"i/(n + 1) is the scale-invariant estimator having uniformly smallest 
risk for the loss function (t - B) 2 jB 2 , the risk of £ X-J(n + 1) is uniformly 
smaller than the risk of ]T X-Jn. Also, since here risk equals \/9 2 times the 
MSE, the MSE of £ XJ{n + 1) is uniformly smaller than the MSE of 
I XJn. 1111 
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7 BAYES ESTIMATORS 

In our considerations of point-estimation problems in the previous sections of 
this chapter, we have assumed that our random sample came from some density 
/( • ; 0), where the funetion/( • ; •) was assumed known. Moreover, we have 
assumed that 0 was som e fixed, though unknown to us, point. In some real- 
world situations which the density/(• ; 0) represents, there is often additional 
information about 0 (the only assumption which we heretofore have made about 
0 is that it can take on values in 0). For example, the experimenter may have 
evidence that 0 itself acts as a random variable for which he may be able to 
postulate a realistic density function. For instance, suppose that a machine 
which stamps out parts for automobiles is to be examined to see what fraction 
0 of defectives is being made. On a certain day, 10 pieces of the machines 

output are examined, with the observations denoted by X 1} X 2 . X 10 , where 

X t = 1 if the ith piece is defective and X t = 0 if it is nondefective. These can 
be viewed as a random sample of size 10 from the Bernoulli density 

f(x; 0) = 0*(1 - 0) 1 -*/ { o, „(*) for 0 < 0 < 1, 

which indicates that the probability that a given part is defective is equal to the 
unknown number 0. The joint density of the 10 random variables X u X 2 , • ■., 
X l0 is 

0 r *'(l-0) lo - I * i n/ {o for 0 < 0 < 1. 

i=l 

The maximum-likelihood estimator of 0, as explained in previous sections, is 
@ = X. The method of moments gives the same estimator. Suppose, how¬ 
ever, that the experimenter has some additional information about 0; suppose 
that he has observed that on various days the value of 0 changes and it appears 
that the change can be represented as a random variable with the density 

0e(0) = 60(l-0)/ to>1 ](0). 

An important question is: How can this additional information about 0 be 
used to estimate 0 O , where 0 O is the value that © was equal to on the day the 
sample was drawn? 

To examine this problem, we will assume, in addition to the assumption 
that our random sample came from a density/(■ ; 0), that the unknown param¬ 
eter 0 is the value of some random variable, say 0. We will still be interested 
in estimating some function of 0, say t(0). If © is a random variable, it has 
a distribution. We let G(-) = G e (*) denote the cumulative distribution function 
of © and g(-) = g®(') denote the density function of ©, and we assume these 
functions contain no unknown parameters. In order to emphasize that the 
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distribution of © is over the parameter space, we have departed from our custom 
of using F( •) and /(•) to represent a cumulative distribution function and 
density function, respectively, and have used G( •) and g( ■) instead. 

If we assume that the distribution of © is known, we have additional 
information. So an important question is: How can this additional information 
be used in estimation? It is this question that we will address ourselves to in 
the following two subsections. In many problems it may be unrealistic to 
assume that 8 is the value of a random variable; in other problems, even though 
it seems reasonable to assume that 9 is the value of a random variable © the 
distribution of © may not be known, or even if it is known, it may contain other 
unknown parameters. However, in some problems the assumption that the 
distribution of © is known is realistic, and we shall examine this situation. 

7.1 Posterior Distribution 

Heretofore we have used the notation f(x; 6) to indicate the density of a random 
variable X for each 8 in 0. Whenever we want to indicate that the parameter 
6 is the value of a random variable ©, we shall write the density of X as f{x\8) 
instead of f(x\ 0). We should note that f{x\8) is a conditional density; it is 
the density of X given © = 8. A more complete notation for f(x \6) would be 
/xi©=«(- x: l 0)- 

Let X u ..., X„ be a random sample of size n from the density /(• 1 6), 
where 8 is the value of a random variable ©. Assume that the density of ©, 
g Q (‘), is known and contains no unknown parameters, and suppose that we 
want to estimate r(8). How do we incorporate the additional information of 
known g e ( ■) into our estimation procedures? In the past, we thought of the 
likelihood function as a single expression that contained all our information; the 
likelihood function included the observed sample x u ..., x n as well as the form 
of the density f(x; 9) we sampled from in its expression. Now we need an 
expression that contains all the information that the likelihood function con¬ 
tains plus the added information of the known density g 0 (-). g 9 ( ■) is called the 
prior distribution of ©. It summarizes what we know about 8 prior to taking 
a random sample. What we seek is an expression that summarizes what we 
know about 8 after we take a random sample. We seek the posterior distribu¬ 
tion of © given X t = x u ..., X„ = x n . 

Definition 29 Prior and posterior distributions The density g®(-) is 
called the prior distribution of ©. The conditional density of 0 given 

Xi = x u ..., X„=x n , denoted by f 9{Xl=Xl . x „=*„(0|xi, ..., x„), is 

called the posterior distribution of ©. //// 
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Remark 



for random sampling. [Recall that f Y \x=xty I*) =/*,r(x, y)/fx( x ) = 
f x[r = y (x\y)fy(y)/f x (x).] HU 


The posterior distribution replaces the likelihood function as an expression 
that incorporates all information. If we want to estimate 8 and parallel the 
development of the maximum-likelihood estimator of 8, we could take as our 
estimator of 8 that 8 which maximizes the posterior distribution, that is, estimate 
8 with the mode of the posterior distribution. However, unlike the likelihood 
function (as a function of 8), the posterior distribution is a distribution function; 
so we could just as well estimate 8 with the median or mean of the posterior 
distribution. We will use the mean of the posterior distribution as our estimate 
of 8, and in general we could estimate r (8) as the mean of t(0) given X i =x l , 

X n = x n \ that is, take <f[r(©)| = x it ..., X n = x „] as our estimate of r(8). 


Definition 30 Posterior Bayes estimator Let X t , ..., X n be a random 
sample from a density f(x\ 8), where 8 is a value of the random variable 
© with known density g s ('). The posterior Bayes estimator of x(8) with 
respect to the prior g B ( ■) is defined to be 


Remark 

*W©) I*! =*!, 



One might note the similarity between the posterior Bayes estimator of 
x(8) = 8 and the Pitman estimator of a location parameter [see Eq. (17)]. 
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EXAMPLE 43 Let X u X n denote a random sample from the Bernoulli 
density /(x| 0) = 0*(1 — S) l ~ x for x = 0, 1. Assume that the prior distri¬ 
bution of © is given by g&(6) = 7 (01) (0); that is, © is uniformly distribu¬ 
ted over the interval (0, 1). Consider estimating 0 and t(0) = 0(1 — 0). 
Now 


/e|Xi=xi.X„ = x„(^l X l> ■ • • > X n) ~ 


6^(1 - ey- £x ‘ do 


so the posterior Bayes estimator of 8 with respect to the uniform prior 
distribution is given by 


<f[© | X t — Xi ,..., X„ — xj 


— J^/eiXi=xi.x„ = ^ I *i> ••• > X n ) dB 

|‘00 Ijc '(l - 6) n - lx ‘ d6 B(Z x { + 2, n - £ x, + 1) 

_ - ey~ Xxi de ~ bqt x , + 1, n - Z x, + 1) 

_ rQ>, + 2)r(f. - 2>, + i) r(n + 2) _ 

r(« + 3) ‘ r(Z x ; + i)r(» - Z x, + 1 ) 

— X X i +i 

« + 2 


Hence the posterior Bayes estimator of 0 with respect to the uniform prior 
distribution is given by QT X t + 1 )j(n + 2). Contrast this to the 
maximum-likelihood estimator of 0, which is Z XJn. Z X t /n is unbiased 
and an UMVUE, whereas the posterior Bayes estimator is not unbiased. 

To obtain the posterior Bayes estimator of, say r(0) = 0(1 — 0), we 
calculate 


S[x(&)\X 1 = x 1 ,...,X n = x n ] 

= / ^ _ ^)/®|Xi=xi.x„=*„(^l ;,£: l> • • • > X n) d9 

|o0(l — 0)0 r *‘(l - 0)"- r *‘ de 

$ i 0 6 lxt (l-6) n - Xxi de 

T(Z x, + 2)r(n - z X, + 2) _ r( H + 2) 

r(n + 4) ’ r(Z x t + i)r(« - Z *.• +1) 

(Z + !)(» - Z x i + i) 

(n + 3 )(n + 2) 

So the posterior Bayes estimator of 0(1 — 0) with respect to a uniform 
prior distribution is (Z X t + l)(n — Z + 1)/(« + 3 ){n + 2). IllI 
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We noted in the above example that the posterior Bayes estimator that 
we obtained was not unbiased. The following remark states that in general a 
posterior Bayes estimator is not unbiased. 

Remark Let T* G = /&X U ..., X„) denote the posterior Bayes estimator 
of t( 0) with respect to a prior distribution G( •). If both T* and r(0) 
have finite variance, then either var [Tj j 8] = 0, or T* is not an unbiased 
estimator of x(8). That is, either T* estimates r(0) correctly with proba¬ 
bility 1, or T* is not an unbiased estimator of r(0). 

PROOF Let us suppose that T* is an unbiased estimator of r(0); 
that is, g[T%i 0] = r(0). By definition we have T£ = t&X u ..., X„) = 
<f[r(©)| X u ..., X n ]. Now 

var IT*] = <ff[var [T*\ ©]] + var [g[T%\ ©]] 

= £[\ax [T%\ ©]] + var [<©)], 

and 

var [r(©)] = [var [x(©)| X u ..., TJ] + var [<£[r(©)| X t . X„]] 

= <?[var [r(©) \X l ,...,X n ]] + var [T£] ; 

hence, <f[var [Tg| ©]] + <f[var [r(©) \X U ..., X„]] = 0. Since both 
<?[var [Tgl ©]] and <f[var [r(©)| X u ..., X n ]] are nonnegative and their 
sum is 0, both are 0. In particular, <?[var [T G \ ©]] = 0, and since 
var [Te| ©] is non-negative and has zero expectation, var [T£| 0] = 0. //// 


7.2 Loss-function Approach 

In Subsec. 3.4 we introduced the concepts of loss and risk. These two concepts 
were used to assess goodness of estimators. In this section we discuss how the 
additional information of knowledge of a prior distribution of © can be used 
in conjunction with loss and risk to define or select an optimum estimator. 

We commence with a review of the problem we hope to solve. Let 

Xi . X„ be a random sample from a density f( x \ ff), 9 belonging to 0, where 

the function /(• 1 6) is assumed known except for 8. We assume that the 
unknown 9 is the value of some random variable © and that the distribution of 
© is known and contains no unknown parameters. On the basis of the random 
sample X u ..., X„ we hope to estimate r(8), some function of 9. In addition, 
we assume that a loss function <f(t; 8) has been specified, where if(t; 9) represents 
the loss incurred if we estimate t( 0) to be t when 9 is the parameter of the 
density from which we sampled. For any estimator T = J(X U X„), we 
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noted in Subsec. 3.4 that S e [£(T ; 0)] represented the average loss of that esti¬ 
mator, and we defined this average loss to be the risk, denoted by 8%/(9), of the 
estimator /(•,..., •). We further noted that two estimators, say T, = 

..., X n ) and T 2 = t 2 (X u ..., X„), could be compared by looking at their 
respective risks and 0l /2 (9), preference being given to that estimator with 

smaller risk. In general, the risk functions as functions of 9 of two estimators 
may cross, one risk function being smaller for some 9 and the other smaller 
for other 9. Then, since 9 is unknown, it is difficult to make a choice between 
the two estimators. The difficulty is caused by the dependence of the risk 
function on 9. Now, since we have assumed that 9 is the value of some random 
variable 0, the distribution of which is also assumed known, we have a natural 
way of removing the dependence of the risk function on 9, namely, by averaging 
out the 9, using the density of © as our weight function. 

Definition 31 Bayes risk Let X u ..., X n be a random sample from a 
density f(x\ 9), where 6 is the value of a random variable © with cumula¬ 
tive distribution function G( •) = G©( ’) and corresponding density 
g( ‘) = 0e(' )• In estimating t(0), let £(t\ 9) be the loss function. The 
risk of estimator T = t(X x , ..., X„) is denoted by 3t/9). The Bayes risk 
of estimator T = /(X 1 , ..., X n ) with respect to the loss function {(• ; ■) 
and prior cumulative distribution G( •), denoted by t(/) = c (/), is 

defined to be 

*(<0 = V, c(/) = jg ^(0) 9(0) d9. (22) 

III/ 

The Bayes risk of an estimator is an average risk, the averaging being over 
the parameter space S with respect to the prior density g( ■). For given 
loss function £( • ; •) and prior density g{ ■ ) the Bayes risk of an estimator is a 
real number; so now two competing estimators can be readily compared by 
comparing their respective Bayes risks, still preferring that estimator with smaller 
Bayes risk. In fact, we can now define the “best” estimator of x (9) to be that 
estimator with smallest Bayes risk. 

Definition 32 Bayes estimator The Bayes estimator of x(9), denoted by 
T* tG = J*, G (X U ..., X„), with respect to the loss function £(■ ; •) and 
prior cumulative distribution G( •), is defined to be that estimator with 
smallest Bayes risk. Or the Bayes estimator of x(9) is that estimator 
/* a satisfying 

"V, g(^*) = *t, g(S*,g) ^ hr, g(J) 

for every other estimator T = /(X t ,X n ) of r(9). 


hh 
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The posterior Bayes estimator of r(0), defined in Definition 30, was defined 
without regard to a loss function, whereas the definition given above requires 
specification of a loss function. 

The definition leaves the problem of actually finding the Bayes estimator, 
which may not be easy for an arbitrary loss function, unsolved. However, for 
squared-error loss, finding the Bayes estimator is relatively easy. We seek that 
estimator, say /*(•,..., •), which minimizes the expression | g &/6)g(6) d6 = 
..., X n ) - T(0)] 2 ]g(6) d0 as a function over possible estimators 

•)• Now, 


f <$o[V(Xu *m 2 Mo) 


dO 


/ (f • 

..,x„)- t( 0)] 2 / Xi> . 

...X„(*l, •••>*» 1#) f[ dxX 

J a r* 


i= 1 ) 

/.{£«»- 

x( v V- \l2 fx l. 

1 . X «\ o)g(Q) de \ 

f\ x l9 • • • > An)\ 

fxi . X„(*l> • • •» X n) i 


' fx u ...,X n ( X l> •••> X n ) f \ d X t 


t = 1 


— J(x t , ...,x n )] 2 /e|X 1 =x I X n *x n (®\ x l’ •••>*») 


’ fx l,.... XnG* 1 > • • • » x n ) 0 dX( , 
i= 1 


and since the integrand is nonnegative, the double integral can be minimized 
if the expression within the braces is minimized for each x lt ..., x„. But the 
expression within the braces is the conditional expectation of [r(0) — /(x u 
..., x„)] 2 with respect to the posterior distribution of © given X l = x u .... 
X n = x n , which is minimized as a function of /(x,, x„) for /*(x lt x„) 

equal to the conditional expectation of r(0) with respect to the posterior distribu¬ 
tion of © given Xi =x u X n = x n . {Recall that #[(Z — a) 2 ] is minimized 
as a function of a for a* = S[Z].} Hence the Bayes estimator of t(9) with re¬ 
spect to the squared-error loss function is given by 


<?[>(©) | Xi = x u 


J>)[n/(*,| 0 ) 

J Lf=i 

i g(8) d0 

I 

-i ~ 1 

g(9) d9 


which is identical to the estimator given in Eq. (21). 


( 23 ) 
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For a general loss function, we seek that estimator which minimizes 
Je @/(8)d(Q) dB. Again, 

f ®J0)g(9) dO 
J 0 


4 [//<«*•• •• 

• ) ^n); 8)fx 1 . X„( X 1 > • • 

■ ,x n \9)f\^dx}\g{9)de 


•■> X n)> 8)fe \X,=x u ...,x„ 

=xM x u •••>*„) del 

' fx 1 .x„( x i> • • 

n 

* 9 %n) 0 dXi , 



i= 1 


and minimizing the double integral is equivalent to minimizing the expression 
within the brackets, which is sometimes called the posterior risk. So, in general, 
the Bayes estimator of r(9) with respect to the loss function • ; •) and prior 
density g( •) is that estimator which minimizes the posterior risk, which is the 
expected loss with respect to the posterior distribution of © given the observa¬ 
tions x lt .x„. We have the following theorem and corollaries. 


Theorem 13 Let X u ..., X„ be a random sample from the density 
f(x\9), and let g(6) be the density of ©. Further let f(t; 9) be the loss 
function for estimating r(0). The Bayes estimator of z(0) is that estimator 
/*( ■) which minimizes 

\j(t( x n ■■■, x n ); 6)fe\ Xl = Xi ,...,x n =xMxi, ...,x n )d9 (24) 

as a function of /(•,..., •)• //// 


Corollary Under the assumptions of Theorem 13, the Bayes estimator 
of t (6) is given by 


«?[<©) | = x u ...,X„ = x n ] = 


J T ( 

4n/(*,-i0) 

li= 1 

g(9)d9 

J 

_i=l J 

g{9) d9 


for a squared-error loss function. 


(25) 

//// 


Corollary Under the assumptions of Theorem 13, the Bayes estimator 
of 9 is given by the median of the posterior distribution of © for a loss 
function equal to the absolute deviation. //// 

The proofs of the theorem and first corollary preceded the statement of the 
theorem. The second corollary follows from the observation that 
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f | 0 — /(x 1, ... 5 X/j) |*/q|Xi =xi . X„=x n ( 9 I > • ■ • ’ X n) d8 

•'§ 

is minimized as a function of /(•,...,•) for ) equal to the median 

of the posterior distribution of ©. {Recall that S [\Z - a | ] is minimized as a 
function of a for a* = median of Z.} 

EXAMPLE 44 Let X u L.bea random sample from the normal density 
with mean 0 and variance 1. Consider estimating 0 with a squared-error 
loss function. Assume that © has a normal density with mean g 0 and 
variance 1. Write g 0 = *o when convenient. According to Eq. (25) the 
Bayes estimator is given as the mean of the posterior distribution of ©. 


/@|Xi=*i, = • • •» X n) 


fx i, ...,x„i«( x i5 • • • > x n\8)g(6) 

hf( x i\8)]g(6) 

J- 1 

fx . . X„(*l. x n ) J | 

hf( x i\0)]g(0)M 

.i=l J 


(1 I*j2n) n exp 

1 J 

(l/s/2n) exp [~i(0 - /h>) 2 ] 

/ (l/V^yexp 
— 00 

-i£( x i~0) 2 

L 1 

(l/y/2n) exp [ —— g 0 ) 2 ] dO 


exp 

i= 0 


/ exp 

J — 00 

-i£( x i-0) 2 

L i = o 

dO 


exp | 

-i 

(n + 1)0 2 -2o£xi+ £ xf 

i = 0 i = 0 , 

I 

/ exp 
" — 00 

h 

(« + 1)0 2 - 20 f x, + £ xf 

1=0 i=o 

J dO 


exp | 

-[(« + l)/2]{ 

02 - 26 £ *;/(« + 1) + 

0 

» 121 
Z *.•/(« + 1)1 j 

1 ) 

/ exp 

V -00 

(-[(« + l)/2] 

|e 2 - 2 o£xi/(n + 1) + 

txJin + l) 2 }jd0 


[l/V2tt/(n + 1)] exp {-[(« + l)/2] 

6 - £ X il( n + 1)1 

0 J 

) 

J [l/V27t/(n + 1)] exp {-[(«+ 0/2] 

— 00 * 

9 - Z *i/(« +1) 

2 j 


1 ( n + 1 [ » -] 2 i 

= ^TT) exp |-— + 
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the denominator is unity since it is the integral of a density. We have 
shown that the posterior distribution of © is normal with mean 
YJo xj(n + 1) and variance l/(n + 1); hence the Bayes estimator of 6 with 
respect to squared-error loss is 


*o + X Mo + X 
_ i __ i 

n + 1 n + 1 

Since the posterior distribution of 0 is normal, its mean and median are 
the same; hence 

Mo+£*i 
_ 1 

n+ 1 

is also the Bayes estimator with respect to a loss function equal to the 
absolute deviation. //// 


EXAMPLE 45 Let X 1 , . .., X n be a random sample from the density f(x\0) = 
(l/0)/ (Oi e)(jc). Estimate 6 with the loss function /((; Q) = (t — 6) 2 /6 2 . 
Assume that © has a density given by g(6) = ho. j)(0). Let y n denote 
max [*!, ..., x n ]. Find the posterior distribution of 0. 

(i/0)".n/(o,«)Wo,i)(e) 

fe\Xi= Xl . x„=x„{8\ x u ■ ■ ■ > x n) = ~~~T « 

J (wn 

J 0 >=1 

(1 leyi^ijO) 
f(m n hy n ,»(0)de 

J yn 

(m n hy„,im 
[l/( n — l)](l/>n _1 - 1)' 

We seek that estimator which minimizes Eq. (24), or we seek that esti¬ 
mator i( ■) which minimizes 

\{\Ky n )-Q\ 2 m{W)hy^mdQ 

[l/(n - 1)](1 lyr 1 - 1) 
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or that estimator which minimizes 


/V 

Equation (26) is a quadratic equation in /(•); this quadratic equation 
assumes its minimum for 


, Ji.(l ld n+1 ) dd _ [l/(— n)] (l - 1 !y n n) _ n+± v 1 — yS 

" Ji„a/0 n+2 ) dd [1 /(- n - 1)](1 - l/lO n 1 - /„ +1 ‘ 


IIII 


We note that the Bayes estimators derived in Examples 43 to 45 are 
functions of sufficient statistics. It can be shown that this is generally true; 
that is, a Bayes estimator is always a function of minimal sufficient statistics. In 
fact, under quite general conditions it can be shown that the Bayes estimator 
corresponding to an arbitrary prior probability density function, which is 
positive for all 6 belonging to 0, is consistent and BAN. So, even if you do not 
know the correct prior distribution, a Bayes estimator has some desirable 
optimum properties. And if you do know the correct prior distribution and 
accept the criterion that a best estimator is one that minimizes average loss, then 
the Bayes estimator corresponding to the known prior distribution is optimum. 

Even in those problems when the prior distribution is unknown, the 
concept of Bayes estimation can benefit us. It provides us with a technique of 
determining many estimators that we might not have otherwise considered. 
Each possible prior distribution has a corresponding estimator, whose merits 
can be judged by using our standard methods of comparison. Thus, we have 
yet another method of finding estimators to append to the methods given in 
Sec. 2. 

Bayes estimation can also sometimes be useful as a tool in obtaining an 
estimator possessing some desirable property that does not depend on prior 
distribution information. The property of minimax is such a property, and 
in the next subsection we will see how Bayes estimation can sometimes be used 
to find a minimax estimator. Another such property is given below. Our 
objective has been to minimize risk, but since risk depended on the parameter, 
we were unable to find one estimator that had smaller risk than all others for 
all parameter values. Minimax circumvented such difficulty by replacing the 
risk function by its maximum value and then seeking that estimator which 
minimized such maximum value. Another way of getting around the difficulty 
arising from attempting to uniformly minimize risk is to replace the risk function 
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by the area under the risk function and to seek that estimator which has the 
least area under its risk function. We note that if the parameter space 0 is 
an interval, the estimator having the least area under its risk function is the 
Bayes estimator corresponding to a uniform prior distribution over the interval 0. 
This is true because for a uniform prior distribution the Bayes risk is propor¬ 
tional to the area under the risk function, and hence minimizing the Bayes risk is 
equivalent to minimizing area. 

7.3 Minimax Estimator 

We defined a minimax estimator at the end of Subsec. 3.4 as an estimator whose 
maximum risk is less than or equal to the maximum risk of any other estimator. 
Such an estimator might be considered “conservative” since it protects against 
the worst that can happen; it seeks to minimize the maximum risk. The follow¬ 
ing theorem is sometimes useful in finding a minimax estimator. 


Theorem 14 If T* = ..., X„) is a Bayes estimator having con¬ 

stant risk, that is, Mf(0) = constant, then T* is a minimax estimator. 


proof Let g*( ■) be the prior density corresponding to the Bayes 
estimator ■ ). 

sup 3tf(0) = constant = ^.(0) 

0E§ 


= f r^A0)g*(0) d6 < f &A0)ff*(0) do < sup aje) 

j q j O 0 e e 


for any other estimator /(•,..., ■). 


IllI 


EXAMPLE 46 Find the minimax estimator of 0 in sampling from the Bernoulli 
distribution using a squared-error loss function. We seek a Bayes esti¬ 
mator with constant risk. The family of beta distributions is a family of 
possible prior distributions. We hope that for one of the beta prior 
distributions the corresponding Bayes estimator will have constant risk. 
A Bayes estimator is given by 


J^gg^Q - 0)"- £ *‘[l/B (a, fo)]^ a ~ 1 (l - of " 1 do 

- $)"~ 1 *([i/B(a,b)]0° -1 (i - ef - 1 de 

Jl0lx l+ «(1 _ Qy,-lx i+ b-l d Q 
~ J10X*,+ _ 9 y,-lx, + b-l d Q 

_ B(X x, + a + 1, n - Y, + b ) = Z x »' + a 
B(Z x, + a, n - Z + b) n + a + b' 
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So the- Bayes estimator with respect to a beta prior distribution having 
parameters a and b is given by 

Y X i + a 
n + a + b 

We now evaluate the risk of (£ + a)/(n + a + b) with the hope that we 

will be able to select a and b so that the risk will be constant. Write 

t* b(x i, ...,x a ) = AYx i + B= (Y x i + a )K n + « + *>); then b( 0 > 
-*[(A £ Xt + B - 9) 2 ] = #[[A{Y X t — n6) + B — 9 + nA9] 2 ] = 
A 2 £[(Y X t - n9) 2 ] + (B — 9 + nA9) 2 = nA 2 9(\ — 9) + (B — 9 + nA9) 2 = 
9 2 [(nA - l) 2 - nA 2 ] + 9[nA 2 + 2(nA - 1)5] + B 2 , which is constant if 
(nA - l) 2 - nA 2 = 0 and nA 2 + 2(nA - 1)B = 0. Now (nA - l) 2 - nA 2 
= 0 if A = l/\/n(y/n + 1), and nA 2 + 2(nA - 1)5 = 0 if 5 = 
-nA 2 /2(nA- 1), which is l/2(y/n + 1) for A = ]j s /'n(Jn + l). On 

solving for a and b, we obtain a = b = sjn/ 2; so (Z x i + 'J n l 2 )K n + \A) 
is a Bayes estimator with constant risk and, hence, minimax. //// 


8 VECTOR OF PARAMETERS 

In this section we present a brief introduction to the problem of simultaneous 
point estimation of several functions of a vector parameter. We will assume 
that a random sample X u ..., X n of size n from the density f(x ; 9 lt ..., 9 k ) is 

available, where the parameter 9 = (9 . . 9 k ) and parameter space 0 are 

^-dimensional. We want to simultaneously estimate ^(0), ..., t r (0), where 
Ty(0), j = 1 >•••,/•, is some function of 9 = (9 lt ..., 9 k ). Often k = r, but this 
need not be the case. An important special case is the estimation of 9 = 
(9i , • • • > 9 k ) itself; then r = k, and (9) = 9 1 ,..., r k (9) = 9 k . Another important 
special case is the estimation of r(0); then r= 1. A point estimator of 
(^(0),..., t r (0)) is a vector of statistics, say (7],..., T r ), where Tj = / j (X 1 ,...,X„) 
and Tj is an estimator of ty(0). 

Our presentation of the method of moments and maximum-likelihood 
method as techniques for finding estimators included the possibility that the 
parameter be vector-valued. So we already have methods of determining esti¬ 
mators. What we need are some criteria for assessing the goodness of an esti¬ 
mator, say (T lf ..., T r ), and for comparing two estimators, say (T u ..., T r ) 
and (Tj, ..., T r ). As was the case in estimating a real-valued function x (0), 
where we wanted the values of our estimator to be close to r(0), we now want 
the values of the estimator (T u ..., T r ) to be close to (t 1 (0), ..., t r (0)). We 
want the distribution of (T 1; ..., T r ) to be concentrated around ((0),..., x r (9)). 
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There are a number of ways of measuring the closeness of an estimator. For 
instance, in comparing two estimators the definitions of “more concentrated” 
and “closer,” given in Subsec. 3.1, can be generalized to r dimensions. We will, 
however, restrict ourselves to consideration of unbiased estimators and define 
several ways of measuring the closeness of an unbiased estimator. No attempt 
will be made in this book to generalize to r dimensions the notions of loss and/or 
risk, invariance, Bayes estimation, and minimax. As far as optimum estima¬ 
tion is concerned, we will be content to consider only unbiased estimators and 
look for a best estimator within the restricted class of unbiased estimators. 

Definition 33 Unbiased An estimator (Tj, ..., T r ), where T} = 
tj(X x , ..., U„), j = 1, ..., r, is defined to be an unbiased estimator of 
(tj( 0), ..., r r (0)) if and only if = t j(ff) for j = 1, ..., r and for all 
0e&. mi 

In Sec. 5, where we considered unbiased estimation of a real-valued func¬ 
tion r(0), we employed the variance of an estimator as a measure of its closeness 
to t( 0). Here we seek a generalization of the notion of variance to r dimensions. 
Several such generalizations have been proposed; we will consider four of them, 
called (i) vector of variances, (ii) linear combination (with nonnegative coeffi¬ 
cients) of variances, (iii) ellipsoid of concentration, and (iv) Wilks’ generalized 
variance. The last two require some knowledge of matrices. 

Possibly the simplest way of generalizing the concept of variance to 
r dimensions is to use the vector of variances of the unbiased estimators T 1 ,...,T r . 
That is, let the vector (var 9 [TJ, ..., var 9 [T r ]) be a measure of the closeness of 
the estimator (T u ..., T r ) to (^(0), ..., r r (0)). The disadvantage of such a 
definition is that our measure is vector-valued and consequently sometimes 
difficult to work with. One way of circumventing this disadvantage is to use a 
linear combination of variances, that is, measure the closeness of the estimator 
(T-l, ..., T r ) to (^(0),..., t r (0)) with Yj]= i a j var a [Tj] for suitably chosen a } > 0. 
Both of these generalizations of variance embody only the variances of the 
Tj , j = 1, ..., r. The Tj are likely to be correlated; so one might justifiably 
think that our measure of closeness of (7), ..., T r ) to (t 1 (0), ..., t r (0)) should 
incorporate the covariances of the T/s. 

Notation If (7\, ..., T r ) is an unbiased estimator of (t 1 (0), ..., t r (0)), 
let Gij(0) = cov 9 [T { , Tj], The matrix whose yth element is <t,- j -( 0) is called 

the covariance matrix of the estimator (Tj. T r ). Let <r‘ J (0) denote the 

i/'th element of the inverse of the covariance matrix. HU 
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Definition 34 Ellipsoid of concentration Let (T t , ..., T r ) be an un¬ 
biased estimator of (rfO), r r (0)). Let a ’-'(0) be the i/th element of the 
inverse of the covariance matrix of (T u T r ), where the i/'th element of 

the covariance matrix is ff y (0) = cov„ [T t , 7}]. The ellipsoid of concen¬ 
tration of (Tj,..., T r ) is defined as the interior and boundary of the ellipsoid 

i i « u W.U - T,(0)][r y - T;(0)] = r + 2. (27) 

<= i j=i 

See Fig. 8 for r = 2. //// 

Loosely speaking, the ellipsoid of concentration measures how concen¬ 
trated the distribution of (T x , ..., T r ) is about .t r (0)). [In fact, if one 

considers the vector random variable, say (U L , U r ), uniformly distributed 

over the ellipsoid of concentration, it can be proved that (U u ..., U r ) and 
(T 1 ,...,T r ) have the same first- and second-order moments.] The distribution of 
an estimator ( 7 ), ..., T r ) whose ellipsoid of concentration is contained within 
the ellipsoid of concentration of another estimator (T/, ..., T r ') is more highly 
concentrated about (^(0),..., r r (0)) than is the distribution of (T/,..., T r '). 

It is known that the determinant of the covariance matrix of an estima¬ 
tor is proportional to the square of the volume of the corresponding ellipsoid of 
concentration; hence another generalization of variance is as in Definition 35. 

Definition 35 Wilks’ generalized variance Let ( 7 ], ..., T r ) be an 
unbiased estimator of (t 1 (0), ..., t r (0)). Wilks’ generalized variance 
of (Ti, ..., T r ) is defined to be the determinant of the covariance matrix 
of (7), ■ • •, T r ). mi 
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Theorem 8 , which showed how sufficiency could be used to improve on an 
arbitrary unbiased estimator, generalizes to r dimensions. The generalization 
is stated without proof. 

Theorem 15 Let X t , X n be a random sample from the density 

f(x;d u ..., d k ), and let S x = o 1 (X 1 ,X n ),S m = o m (X 1 . X n ) be 

a set of jointly sufficient statistics. Let (T lt ..., T r ) be an unbiased esti¬ 
mator of (^(0), ..., T r (0)). Define T] = £[Tj \ S v ..., SJ, j = 1, ..., r. 
Then, 

(i) (T[, ..., T r ') is a statistic and an unbiased estimator of 
(■^(0), ..., r r (0)), and Tj = J'/S 1 , S m ); that is, Tj is a function of the 
sufficient statistics S lt ..., S m , j = 1 , ..., r. 

(ii) var s [T]] < var a [T ; ] for every 6 e S, j = 1. r. 

(iii) The ellipsoid of concentration of (T/, ..., T r ') is contained in 

the ellipsoid of concentration of (T lt T r ), for every 0 e 0. //// 

We might note that (ii) implies that 

X <ij var fl [TJ] < £ a i var e [Tj] for S; 0 
j =i ]= i 

and (iii) implies that Wilks’ generalized variance of (T[, ..., T/) is smaller than 
Wilks’ generalized variance of (T u ..., T r ). 

Theorem 10 of Sec. 5 can also be generalized to r dimensions, but first the 
concept of completeness has to be generalized. 

Definition 36 Joint completeness For X l . X n , a random sample 

from the density /(x; ..., 6 k ), let (T 1; ..., T m ) be a set of statis¬ 

tics. T l5 ..., T m are defined to be jointly complete if and only if 
£e[*(T ..., T m )j = 0 for all 6 e 0 implies that P 9 [x(T u ..., T m ) = 0] = 1 
for all 0 6 0, where x(T t ,T m ) is a statistic. //// 


EXAMPLE 47 Let X u ..., X„ be a random sample from 

fix’, 0 i, # 2 ) = q _ q Aei, 02 >( x )’ 

where 0 t < 0 2 • Write 6 = (0 1 , 0 2 ). Let >j = min [Tj, ..., Jf B ] and 
Y„ = max [J 4 . X n ], We want to show that Yj and 7„ are jointly 
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complete. (We know that they are jointly sufficient.) Let x(Y u Y„) be 
an unbiased estimator of 0, that is, 

*MYi> y *)l = 0 for a11 6 6 Q- 


Now 

<^M y i, y„) ] = yJfrurSyn y«) 

,02r ,Fn /v. - 01 yi - 0i\" -2 

■ l, [/., *°’" ^" ( " ~ ~ 51^;) 

- • 57^07 ^'1 

which is identically 0 if and only if 

,02 ,yn 

f f *(y i. y»Xy» - ^i)" s 0 for 0 i < 0 2 • 

*'01 *'01 

Differentiate both sides with respect to 0 2 , and obtain 

f *Oh> 02)(02 -kiy" 2 ^! =0 forall 0 !< 0 2 ; 

J 01 

now differentiate both sides of the resulting identity with respect to 
0 ls and obtain —x(G 1 , 0 2 )(0 2 — 0 i )" -2 = 0 for all 0 1 <0 2 , and hence 
*( 0 i, 0 2 ) = ° for 0 i < G 2 j that is, «(y 1( y „)=0 for < y„, where y x 
and y n are the possible values of Y 2 and Y„. We have shown that Y k and 
Y„ are jointly complete. //// 


If the density/(x; 6 k , ..., G k ) is a member of the ^-parameter exponential 
family, a set of jointly complete and sufficient statistics can be found using the 
following theorem. It is a k-dimensional analog of Theorem 9 and is stated 
without proof. The following theorem is not precisely stated; certain regularity 
conditions are omitted [16]. 

Theorem 16 Let X t ,X„ be a random sample from/(x; G u ■ ■ ■, G k ). 

If/(x; G u ..., 0*) = a(0!, ..., G k )b(x) exp [£ Cj .(Q u Q k ) dj(x)], that 

j = 1 

is,/(x; &i . G k ) is a member of the ^-parameter exponential family, then 

( n n \ 

■ • • -&MXd\ is a minimal set of jointly complete and sufficient 
statistics. //// 
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EXAMPLE 48 Let X u ..., X n be a random sample from 

fix', 01 , 0 z) = ^a,,e 2 2 W = - j ==— exp M 1 . 

V27I02 L V 02 / J 


Now 




so £ and £ JT? 

i=l i=l 

Theorem 16. 


are jointly complete and sufficient statistics by 


//// 


We will state without proof the vector analog of Theorem 10. In the 
same sense that an UMVUE was optimum, this following theorem gives an 
optimum estimator for a vector of functions of the parameter. 

Theorem 17 Let X u ..., X„ be a random sample from/(x; 6 lt ..., 0 k ). 
Write 0 = (0 l5 ..., 0*). If Sj = a l (X l , ..., X„),..., S m = o m (X u ..., X n ) 
is a set of jointly complete sufficient statistics and if there exists an un¬ 
biased estimator of (^( 0 ), ..., x r ( 0 )), then there exists a unique unbiased 
estimator of (t x ( 0), ..., x r (0)), say T* = /f(S,, SJ, T* = 
t , ..., S m ), where each t* is a function of S 1( ..., S m , which satisfies: 

(i) var a [T*] < var a [Tj] for every 6 e 0, j = 1, ..., r, for any 
unbiased estimator (7j, ..., T r ) of (tj(0), ..., x r (6)). 

(ii) The ellipsoid of concentration of (T*, .T*) is contained in 

the ellipsoid of concentration of (7j, T r ), where (Tj, ..., T r ) is any 
unbiased estimator of (t x (0), ..., x r (6)). //// 


There are four different maximal subscripts, all of which are intended. 
n denotes the sample size, k denotes the dimension of the parameter 6, m is the 
number of real-valued statistics in our jointly complete and sufficient set, and r 
is the dimension of the vector of functions of the parameter that we are trying 
to estimate. In practice, it will turn out that usually 'k = m. The estimator 
(T*, T*) is optimal in the sense that among unbiased estimators it is the 

best estimator using any of the four generalizations of variance that have been 
proposed. 

Just as was the case in using Theorem 10, we have two ways of finding 
(T*,T*). The first is to guess the correct form of the functions /*,..., /*, 
which are functions of S lt ..., S m , that will make them unbiased estimators of 
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Tj(0), r r (0). The second is to find any set of unbiased estimators of 

ti(0) , t r (0) and then calculate the conditional expectation of these unbiased 

estimators given the set of jointly complete and sufficient statistics. We employ 
only the first method in the following examples. 


EXAMPLE 49 Let X u .... X n be a random sample from the density 
/(x; d u 0 2 ) = [1 1(6 2 - 0 1 )]/ (fll , 0 2) (*)- Suppose we want to jointly estimate 
the range and midrange, that is, r 1 (0) = 0 2 — and t 2 (0) = (0i 4- 6 2 )I2. 
We know that Y x = min [X u - . X n ] ant * = max [-^i> • • • > X n ] are 
jointly sufficient (see Example 23); also, they are jointly complete (see 
Example 47). Hence, to find the unbiased estimator (T*, T*) which has 
uniformly smallest variance for each component among all unbiased 
estimators, it suffices to find the unbiased estimator that is a function of 
the jointly complete sufficient statistics. Since ^[yj = 0 1 + (0 2 — 0i)/(n +1) 
and g[Y a ] = 0 2 - (0 2 - 0 t )/(n + 1), ([(n + l)/(n - l)](y„- YJ, (y, + y„)/2) 
is the unbiased estimator of ( d 2 — 0 1( (0 t + 0 2 )/l) that we are seeking. //// 


EXAMPLE 50 Let X 2 . X n be a random sample from the normal den¬ 

sity /(x; 6 U 0 2 ) = tpp „ 2 (x). By Examples 22 and 48, ]T X t and £ Xf are 
jointly complete and sufficient statistics. Hence, by Theorem 17, 
(I XJn, X (X t - X) 2 /(n — 1)) is an unbiased estimator of (ji, a 2 ) whose 
corresponding ellipsoid of concentration is contained in the ellipsoid of 
concentration of any other unbiased estimator. [Note: £ (X t — X) 2 = 
X Xf — nX 2 \ so the estimator £ (X t — X) 2 j(n — 1) is a function of the 
jointly complete and sufficient statistics £ X ( and £ Xf.] 

For this same example, suppose we want to estimate that function 
of 0 = 0*, <x 2 ) satisfying the following integral equation: 



for a fixed and known. r(0) is that point which satisfies P[X t > r(0)] = a; 
that is, it is that point which has 100a percent of the mass of the population 
density to its right, or t(0) is the (1 - a)th quantile point. We have 
1 — a = €>([t(0) - h]/<t); so t(0) = n + z 1 _ x <j, where z l _ x is given by 
®( z i -«) = 1 — <*■ Since a is known, z 1 can be obtained from a table 
of the standard normal distribution. To find the UMVUE of t(0), it 
suffices to find the unbiased estimator of p + z 1 _ ct <r which is a function of 
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Y and Y Xf. We know that X is the UMVUE of fi, and it can be 
verified that 


U(n - l)/2] 

Hn/2)^ 


JY (Xi — X) 2 = T 


* 


say, is the UMVUE of a ; hence X + Zl _ x T* is the UMVUE of r(0). We 
have employed Theorem 17 for r= 1; our vector of functions of the 
parameter that we wanted to estimate was unidimensional. //// 


9 OPTIMUM PROPERTIES OF 

MAXIMUM-LIKELIHOOD ESTIMATION 

Several methods of finding point estimators were presented in Sec. 2 of this 
chapter. There, and in succeeding sections, we have particularly emphasized 
the method of maximum likelihood. In this section we will partially justify 
such emphasis by considering some optimum properties of maximum-likelihood 
estimators. 

For simplicity of presentation, let us consider the maximum-likelihood 
estimation of the parameter 0 , which is to be estimated on the basis of a random 
sample from a density /(• ; 0), where 0 is assumed to be a real number. That is, 
let us consider the unidimensional-parameter case and estimate 0 itself. Recall 
that for the observed sample x lt ..., x n the maximum-likelihood estimate of 0 
is that value, say 6, of 0 which maximizes the likelihood function £( 0 ; x lt ..., x n ) 

n ^ ^ 

= n/(*,; &)• Let ©„ = & n (X ly X„) denote the maximum-likelihood 

i= 1 

estimator of 0 based on a sample of size n. We defined and discussed in Sec. 3 
of this chapter a number of properties that an estimator may or may not possess. 
Recall that some of these properties, such as unbiasedness and uniformly 
minimum variance, are referred to as small-sample properties, and others of 
these properties, such as consistency and best asymptotically normal, are referred 
to as large-sample properties. The use of the word “ small ” in “ small-sample ” 
is somewhat misleading since a small-sample property is really a property that is 
defined for a fixed sample size, which may be fixed to be either small or large. 
By a large-sample property, we mean a property that is defined in terms of the 
sample size increasing to infinity. Our main result of this section will be con¬ 
tained in Theorem 18 below and will concern optimum large-sample properties 
of maximum-likelihood estimation. 
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We have already observed some small-sample properties of maximum- 
likelihood estimation. For instance, we have noted two things: first, that some 
maximum-likelihood estimators are unbiased and others are not and, second, 
that some maximum-likelihood estimators are uniformly minimum-variance 
unbiased and others are not. For example, in the density f(x, 0) — <f>$ t i{x) the 
maximum-likelihood estimator of 0 is X, which is the uniformly minimum- 
variance unbiased estimator of 0, whereas in the density/(x; 0) = (l/0)/ [O , 0 ](x) 
the maximum-likelihood estimator of 0 is Y n = max [X 1 , ..., X n ], which is 
biased. [We might note here that the Y n in this last example can be corrected 
for bias by multiplying Y n by (n + 1 )/n and that the estimator that is thus 
obtained is uniformly minimum variance unbiased.] 

One property that it seems reasonable to expect of a sequence of esti¬ 
mators is that of consistency. Theorem 18 will show, in particular, that generally 
a sequence of maximum-likelihood estimators is consistent. 


Theorem 18 If the density/(x; 0) satisfies certain regularity conditions 
and if 0„ = 0„(A' 1 , ..., X n ) is the maximum-likelihood estimator of 0 for 
a random sample of size n from/(x; 0), then: 


0 ) 

variance 


&„ is asymptotically normally distributed with mean 0 and 

1/n ^[[^ 1Og/(X;0) ] 2 ]- 


(ii) The sequence of maximum-likelihood estimators @ ls ..., 0„, 
... is best asymptotically normal (BAN). //// 


We will not be able to prove Theorem 18. In fact, we have not precisely 
stated it, inasmuch as we have not delineated the regularity conditions. We 
do, however, want to emphasize what the theorem says. Loosely speaking, it 
says that for large sample size the maximum-likelihood estimator of 0 is as 
good an estimator as there is. (Other estimators might be just as good but not 
better.) 

We might point out one feature of the theorem, namely, that the asymptotic 
normal distribution of the maximum-likelihood estimator is not given in terms of 
the distribution of the maximum-likelihood estimator. It is given in terms of 
/(•;0), the density sampled. Also, the variance of the asymptotic normal 
distribution given in the theorem is the Cramer-Rao lower bound. 

EXAMPLE 51 Let X 1 , X n be a random sample from the negative 
exponential distribution f(x; 0) = 0e" 0x / [O> ^(x). It can be routinely 
demonstrated that the maximum-likelihood estimator of 0 is given by 
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n __ 

n /X X; = l/X n . According to Theorem 18 above, the maximum-likeli- 

i 

hood estimator has an asymptotic normal distribution with mean 0 and 

variance equal to 

_ 1 __ 0 ^ 

n ^e[[~\ogf(X; 0 )] 2 ] " 

(See Example 29.) //// 

We have ordinarily considered estimation of r(0) some function of 0, 
rather than estimation of 0 itself. For maximum-likelihood estimation, we 
noted (see Theorem 2) that the maximum-likelihood estimator of t( 0) was given 
by t( 0 ), where © was the maximum-likelihood estimator of 0 . If we assume 
that t( •) is differentiable, then it can be shown that t((S) has an asymptotic 
normal distribution with mean r( 0 ) and variance 


[ r '(0)] 2 


,[[^log/(X;0)] 2 


which is the Cramer-Rao lower bound. (See Theorem 7.) 

Maximum-likelihood estimators possess similar optimum large-sample 
properties in the case of a k-dimensional parameter. For instance, it can be 
proved (again under regularity conditions) that the joint distribution of the 
maximum-likelihood estimators is asymptotically distributed as a multivariate 
normal distribution. Let us illustrate for the case when k = 2; that is, 
0 = (0j, 0 2 ). Recall that the bivariate normal distribution is specified by the 
five parameters /q, p 2 , c 2 , <r§, and p. (See Sec. 5 of Chap. IV.) It turns out 
that under certain regularity conditions the joint distribution of the maximum- 
likelihood estimators and © 2 > s asymptotically distributed as a bivariate 
normal distribution with parameters fi x = 9 U p 2 = $ 2 > 


a\ = 


<?! = 


— <S 8 

I 


, 0 f log/(X; 0 ) 

nA 




nA 
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and 

^ii& log/ H 

pa i ff 2 - JJa ’ 

where 

4 = ^ log/< * ; e) W5 log/(Jf; 8) ] ' (^“[aeS57 1<>s/(Jf:0> ]) 2 ' 


EXAMPLE 52 Let ..., X„ be a random sample from the density 

f(x; 9 ) = fix; 0„ e 2 ) = </> 9l> a 2 (x) = -^= 

We have already derived, in Example 6, the maximum-likelihood estimators 
of 9 l and 0 2 ; they are, respectively, 

and & 2 = 1 -iiX i -& 1 ) 2 . 


According to the above, the asymptotic large-sample joint distribution of 
©! and © 2 is a bivariate normal distribution with means 9 1 and 0 2 . 
Since log fiX; 9) = —i log 2n - $ log 9 2 — (1/2 9 2 )iX — 9f) 2 , the required 
derivatives are 


and 


and because 


^f(X;9) = 


d 2 

69 2 S9 x 


log fiX; 9) = 


82 ' 

X-9 1 

81 


d 2 

89 2 


1Og/(X;0) = 2^i 


iX - 9j) 2 
81 ’ 


S[X] = 9 1 and S[(X - 9 j) 2 ] = 9 2 , 
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which gives A = 1/2 6\. Finally, then, a\ = 0 2 /n, o\ = 29\jn, and p = 0. 

mi 


PROBLEMS 

1 An urn contains black and white balls. A sample of size n is drawn with replace¬ 
ment. What is the maximum-likelihood estimator of the ratio R of black to 
white balls in the urn ? Suppose that one draws balls one by one with replacement 
until a black ball appears. Let A 1 be the number of draws required (not counting 

the last draw). This operation is repeated n times to obtain a sample X u X 2 . 

X„ . What is the maximum-likelihood estimator of R on the basis of this sample? 

2 Suppose that n cylindrical shafts made by a machine are selected at random from 
the production of the machine and their diameters and lengths measured. It is 
found that N tl have both measurements within the tolerance limits, N 12 have 
satisfactory lengths but unsatisfactory diameters, N 2l have satisfactory diameters 
but unsatisfactory lengths, and N 22 are unsatisfactory as to both measurements. 
2 N tJ — n. Each shaft may be regarded as a drawing from a multinomial popula¬ 
tion with density 

pWpWpIVU —Pii ~Pi2 ~ P 21 )* 22 for x,j = 0,1; 2 X <J = 1 

having three parameters. What are the maximum-likelihood estimates of the 
parameters if A,, = 90, N i2 = 6 , N 2t = 3, and N 22 = 1 ? 

3 Referring to Prob. 2, suppose that there is no reason to believe that defective 
diameters can in any way be related to defective lengths. Then the distribution 
of the X, j can be set up in terms of two parameters: p t , the probability of a satis¬ 
factory length, and qi, the probability of a satisfactory diameter. The density of 
the X, j is then 

(Pi9ir u [pi(i -<?i)r i2 [(i ~/>i)<?ir 2i Ki -PiXi -<?i>r 22 

for x,j = 0 , 1 ; J,x,j = l. 

What are the maximum-likelihood estimates for these parameters ? Are the prob¬ 
abilities for the four classes different under this model from those obtained in the 
above problem? 

4 A sample of size rti is to be drawn from a normal population with mean p, and 
variance <rj. A second sample of size n 2 is to be drawn from a normal population 
with mean p 2 and variance u 2 . What is the maximum-likelihood estimator of 
6 = pi— p 2 ? If we assume that the total sample size n = th + n 2 is fixed, how 
should the n observations be divided between the two populations in order to 
minimize the variance of the maximum-likelihood estimator of 62 
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5 A sample of size n is drawn from each of four normal populations, all of which 
have the same variance a 2 . The means of the four populations are a + b + c, 
a + b — c , a — b + c, and a—b — c. What are the maximum-likelihood estima¬ 
tors of a, b, c, and o 2 ? (The sample observations may be denoted by X,j, i = 1, 
2, 3, 4 and./ = 1 , 2, .... n.) 

6 Observations Xt, X 2 , X„ are drawn from normal populations with the same 

mean /n but with different variances a 2 , c\ . al. Is it possible to estimate all 

the parameters? If we assume that the o] are known, what is the maximum- 
likelihood estimator of p? 

7 The radius of a circle is measured with an error of measurement which is dis¬ 
tributed AT0, a 2 ), a 2 unknown. Given n independent measurements of the 
radius, find an unbiased estimator of the area of the circle. 

8 Let X be a single observation from the Bernoulli density /(*; 0) = 
0*(1 — 0) 1 “*/,o. i }(jc), where 0 < 0 < 1. Let ACS') = X and 1 2 {X) — i. 

(a) Are both h(X) and / 2 (.X) unbiased? Is either? 

(b) Compare the mean-squared error of /i(X) with that of / 2 (X). 

9 Let Xi, X 2 be a random sample of size 2 from the Cauchy density 


/(*; 0 ) = - 


l 


— oo < 0 < oo. 


7r[l+(*-0) 2 )’ 

Argue that (X t + X 2 )/2 is a Pitman closer estimator of 6 than X t is. [Note that 
(Xi + X 2 )/2 is not more concentrated than X, since they have identical distribu¬ 
tions.] 

10 Let 0 denote some physical quantity, and let Xi . X„ denote n measurements of 

the physical quantity. If 6 is estimated by 0, then the residual of the ith measure¬ 
ment is defined by X, — 0, / = 1,..., n. Show that there is only one estimator 
with the property that the residuals sum is 0, and find that estimator. Also, 
find that estimator which minimizes the sum of squared residuals. 

11 Let Xt ,.... X„ be a random sample from some density which has mean p and 
variance a 2 . 


(u) 


Show that 2 a < X t is an unbiased estimator of p for any set of known constants 
1 

ft 

«i,. - -, a n satisfying 2 «i = 1- 


(b) If 2 . a t = 1, show that var [ 2 -^i] is minimized for a, = l/«, / = 1. n. 

n n n 

[Hint: Prove that 2 = 2 — i/") 2 + l/n when f a, = 1.1 

11 1 

12 Let X u • • •, X„ be a random sample from the discrete density function /( x; 6) = 
Q*( 1 — 0) 1 "*/(o. d(x), where 0 < 0 < i. Note that S = {0: 0 < 0 < i}. 

(a) Find a method-of-moments estimator 0, and then find the mean and mean- 
squared error of your estimator. 

(b) Find a maximum-likelihood estimator of 0, and then find the mean and 
mean-squared error of your estimator. 
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13 Let Xi, X 2 be a random sample of size 2 from a normal distribution with mean 9 
and variance 1. Consider the following three estimators of 9: 

T\ = X 2 ) — ^ X i •- \X 2 

T 2 = t 2 (X u X 2 ) = iX, + iX 2 
T 2 = /jfA'i, X 2 ) = zX\ ■ | $X 2 - 

(a) For the loss function t{t\ 9) 3 9 2 (t — 9) 2 , find for i= 1, 2, 3, and 

sketch it. 

(b) Show that T, is unbiased for / = 1, 2, 3. 

14 Let Xi ,..., X „ r ,... be independent and identically distributed random variables 
from some distribution for which the first four central moments exist. We know 
that $ [S 2 ] = a 2 and 


„ 1/ k—3 \ 

var [S 2 ] = - (^4-r ), 

n\ n — 1 / 


s 2 = —rM-Jr) 2 - 

n — 1 

Is S 2 a mean-squared-error consistent estimator of a 2 ? 

IS In genetic investigations one frequently samples from a binomial distribution 




‘ except that observations of x = 0 are impossible; so, in fact, the 


sampling is from the conditional (truncated) distribution 


x ) 1 - Q m / “- 2 . m) ^' 


Find the maximum-likelihood estimator of p in the case m — 2 for samples of 
size n. Is the estimator unbiased? 

16 Let X be a single observation from A r (0, 9). (6 = cr 2 .) 

(a) Is Xa sufficient statistic? 

( b ) Is | X\ a sufficient statistic? 

(c) Is X 2 an unbiased estimator of 97 

(d) What is a maximum-likelihood estimator of V91 

( e ) What is a method-of-moments estimator of V91 

17 Let X have the density f(x',9) = (0/2)'*'(l — 0) 1_lxl /,-i.o.u(x), O<0<1. 
Define /(x) = 2/ ul (x). 

(a) Is A" a sufficient statistic? A complete statistic? 

(b) Is | X\ a sufficient statistic? A compile statistic? 

(c) What is a maximum-likelihood estimator of 9? 
id) Is T= /(X) an unbiased estimator of 91 

(e) Does f(x; 6) belong to an exponential class? 

(/) Find an estimator with uniformly smaller mean-squared error than that of 
t(X), if such exists. 
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18 Let Xi, X 2 , ■. •, X, be a random sample from the density 

f(x\ 0) = dx~ 2 Iie, ®)(x) 

where 6 > 0. 

(a) Find a maximum-likelihood estimator of 6. 

(b) Is Yi = min [X t .X,] a sufficient statistic ? 

79 Let Xi .A; be a random sample from f(x\ 6) = ie ~ ,x ~", — < 9 < co. 

{a) Discuss sufficiency for this density. 

( b ) Obtain a method-of-moments estimator of 6. 

(c) Find a maximum-likelihood estimator of 9. 

(d) Does/( jc; 9) belong to an exponential class? 

20 Find a maximum-likelihood estimator for a in the density /(*; a) = 
(2/a 2 )(a - x)/<o. «,(jc) for samples of size 2. Is it a sufficient statistic ? Estimate a 
by the method of moments. What is the maximum-likelihood estimator of the 
population mean ? 

21 Let X it ..., X„ be a random sample from f(x; 9) = (1 /0)/ [O , e j(x), where 0 > 0. 
Define y„ = max [X it .... X„] and Y t =min [X t ,...,X n ]. 

(а) Estimate 6 by the method of moments. Call the estimator T,. Find its 
mean and mean-squared error. 

(б) Find the maximum-likelihood estimator of 9. Call the estimator T 2 . Find 
its mean and mean-squared error. 

(c) Among all estimators of the form a Y „, where a is a constant which may depend 
on n, find that estimator which has uniformly smallest mean-squared error. 
Call it T 3 . Find its mean and mean-squared error. 

( d ) Find the UMVUE of 9. Call it T*. Obtain its mean and mean-squared 
error. 

(e) Let T s = Y t + Y„. Find the mean and mean-squared error of T s . 

(/) What estimator of 9 would you use and why? 

( g ) Find the maximum-likelihood estimator of the variance of the population. 

22 Let Xi,...,X„bea. random sample from the Bernoulli distribution, say P[X = 1] = 

0 = l_P[Ar = O]. 

(a) Find the Cramer-Rao lower bound for the variance of unbiased estimators of 

0(1 - 0 ). 

(b) Find the UMVUE of 0(1 — 0) if such exists. 

23 Assuming r known, find the maximum-likelihood estimator for A for a random 
sample of size n from a gamma distribution. Find a sufficient statistic if one 
exists^ Is your maximum-likelihood estimator unbiased? Is there an UMVUE 
of A? 

24 Let X u ..., X n be a random sample from 0 jc # - , /, o> ,,(*), where 0 > 0. 

(a) Find the maximum-likelihood estimator of /u = 0/(1 + 0). 

(b) Find a sufficient statistic, and check completeness. Is 2 X, a sufficient 
statistic? 

(c) Is there a function of 0 for which there exists an unbiased estimator whose 
variance coincides with the Cram6r-Rao lower bound? 

*(d) Find the UMVUE of each of the following: 0,1/0, = 0/(1 + 9). 
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25 Let Xi, ..., X n be a random sample from the binomial distribution 



p*(l-/>)-*, * = 0,1, 


m, where m is known and 0 <p < 1. 


(a) Estimate p by the method of moments and the method of maximum likelihood. 

(b) Is there an UMVUE of p? If so, find it. 

*26 Let X t . X„ be a random sample from the discrete density function 


/(*;0) = (im.,2.«(*), 

where 0 = 1,2, — That is, © = {0: 0 — 1,2,...} = the set of positive integers. 

(a) Find a method-of-moments estimator of 0. Find its mean and mean-squared 
error. 

(b) Find a maximum-likelihood estimator of 0. Find its mean and mean-squared 
error. 

(c) Find a complete sufficient statistic. 

(d) Let T=Y n , the largest order statistic. Show that the UMVUE of 0 is 
[T" +1 - (T- l) n+1 ]/[r" - (T— 1)”]. 

27 Let X be a single observation from the density [1/B(0, 0)]**“'( 1 — *) el / ( o. i>(*). 
Is X a sufficient statistic? Is X complete? 

28 An experimenter knows that the distribution of the lifetime of a certain component 
is negative exponentially distributed with mean 1/0. On the basis of a random 
sample of size n of lifetimes he wants to estimate the median lifetime. Find both 
the maximum-likelihood and uniformly minimum-variance unbiased estimator of 
the median. 

29 Let Xi,X n be a random sample from N(6, 1). 

(a) Find the Cram6r-Rao lower bound for the variance of unbiased estimators of 
0, 0 2 , and FIX > 0]. 

(b) Is there an unbiased estimator of 0 2 for n = 1 ? If so, find it. 

(c) Is there an unbiased estimator of P[X > 0]? If so, find it. 

( d ) What is the maximum-likelihood estimator of P[X > 0] ? 

(e) Is there an UMVUE of 0 2 ? If so, find it. 

(/) Is there an UMVUE of P[X >0]? If so, find it. 

30 For a random sample from the Poisson distribution, find an unbiased estimator of 
t(X) =(1 4- A)e~\ Find a maximum-likelihood estimator of r(A). Find the 
UMVUE of t(A). 

31 Let X \,..., X n be a random sample from the density 

.2* 

/(*; 0) = J<o.e>(x) 


where 0 > 0. 

(a) Find a maximum-likelihood estimator of0. 

(b) Is Y„ = max [X u ..., X„] a sufficient statistic? Is Y„ complete? 

(c) Is there an UMVUE of 0? If so, find it. 
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32 Let Xu ■ • • , X« be a random sample from the density 

f{x\ 0) = 0(l + *)- (1 + •>/<„. «,W for 0 > 0. 

(a) Estimate 0 by the method of moments assuming 0 > 1. 

(b) Find the maximum-likelihood estimator of 1/0. 

(c) Find a complete and sufficient statistic if one exists. 

(d) Find the Cramer- Rao lower bound for unbiased estimators of 1/0. 

(e) Find the UMVUE of 1/0 if such exists. 
if) Find the UMVUE of 0 if such exists. 

33 Let X u .... X„ be a random sample from 

f(x\ 0) = h #](*) for 0 > 0. 

(a) Find a maximum-likelihood estimator of 0. 

ib) Suppose n — 1, so that you have only one observation, say X = X L . Clearly 
X is a sufficient statistic. Is X a minimal sufficient statistic? Is X complete ? 

34 Let X it ..., X„ be a random sample from the negative exponential density 

fix', 0) = 0e~ ,J 7 ro , »)0). 

(a) Find the uniformly minimum-variance unbiased estimator of var [A'J if such 
exists. 

ib) Find an unbiased estimator of 1/0 based only on = min [X lf ..., X„]. 
Is your sequence of estimators mean-squared-error consistent? 

35 Let Xi ,:.., X n be a random sample from the density 

log 0 

fix;0) = j~e x i (o , i} ix), 0 > i. 

(a) Find a complete sufficient statistic if there is one. 

ib) Find a function of 0 for which there exists an unbiased estimator whose 
variance coincides with the Cramer-Rao lower bound if such exists. 

• 36 Show that 

£°[{yo 1 °efix■, 0)} 2 = -*.|log fix-, 0 )]. 

37 Let X u ..., X„ be a random sample from the density 

fix; 0) = exp i-e- (I - (a) * (c) * e) ), 

where — oo < 0 < oo. 

(a) Find a method-of-moments estimator of 0. 
ib) Find a maximum-likelihood estimator of 0. 

(c) Find a complete sufficient statistic. 

id) Find the Cramer-Rao lower bound for unbiased estimators of 0. 

ie) Is there a function of 0 for which there exists an unbiased estimator, the 
variance of which coincides with the Cramer-Rao lower bound? If so, find it. 

*(/) Show that r'(n)/r(«) - log (]T e ~ x i) is the UMVUE of 0. 
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38 Let X u ..., X„ denote a random sample from 

/(*; 6) =/„(*) = 6Mx) + (1 - 0)f o (x), 

where 0 < 6 < 1 and /!(•) and/ 0 ( •) are known densities. 

(a) Estimate 6 by the method of moments. 

( b ) For n = 2, find a maximum-likelihood estimator of 8. 

(c) Find the Cramer-Rao lower bound for the variance of unbiased estimators 
of 0. 

39 Suppose that a( ■) and b( •) are two nonnegative functions such that fix; 8) = 
a(0)b(x)I iO .e)(x) is a probability density function for each 8 > 0. 

(a) What is a maximum-likelihood estimator of 81 

( b ) Is there a complete sufficient statistic? If so, find it. 

(c) Is there an UMVUE of 61 If so, find it. 

40 Let Xi,X„ be a random sample from N(8, 8), 8> 0. 

(a) Find a complete sufficient statistic if such exists. 

(b) Argue that X is not an UMVUE of 8. 

ic) Is 6 either a location or scale parameter? 

41 Let Xi ,..., X„ be a random sample from N(0, 0 2 ), — oo < 6 < oo. 

(a) Is there a unidimensional sufficient statistic? 

( b ) Find a two-dimensional sufficient statistic. 

(c) Is ^an UMVUE of 61 {Hint: Find an unbiased estimator of 6 based on S 2 ; 
call it T*. Find a constant a to minimize var \aX + (1 — a)T*].} 

(,d ) Is 6 either a location or scale parameter? 

42 Let Xi ,..., X„ be a random sample of size n from the density 

fix', 9) = g lie. 26 ](x), 8 > 0. 

(a) Find a maximum-likelihood estimator of 8. 

(b) We know that Yi and Y„ are jointly sufficient. Are they jointly complete? 

(c) Find the Pitman estimator for the scale parameter 8. 

( d ) For a and b constant (they may depend on n), find an unbiased estimator of 
6 of the form aYi + bY„ satisfying P[YJ2 <aYi + bY„ < Yf\ = 1 if such 
exists. Why is P[Y„l2 <aYi + bY„<. Yi] = 1 desirable? 

43 Let Zi,..., Z„ be a random sample from N( 0, 8 2 ), 6 > 0. Define X,= \ Z, \, and 
consider estimation of 8 and 8 2 on the basis of the random sample Xi,X„. 

(a) Find the UMVUE of 8 2 if such exists. 

( b ) Find an estimator of 8 2 that has uniformly smaller mean-squared error than 
the estimator that you found in part (a). 

(c) Find the UMVUE of 8 if such exists. 

id) Find the Pitman estimator for the scale parameter 8. 

(e) Does the estimator that you found in part id) have uniformly smaller mean- 
squared error than the estimator that you found in part (c)? 
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Let Xi ,.... X„ be a random sample from 

fix; 8) 6) V „>(*) for — co < 0 < co. 

(a) Find a sufficient statistic. 

(b) Find a maximum-likelihood estimator of 8. 

(c) Find a method-of-moments estimator of 8. 

(d ) Is there a complete sufficient statistic? If so, find it. 

(e) Find the UMVUE of 8 if one exists. 

(f) Find the Pitman estimator for the location parameter 8. 

(g) Using the prior density g(&) = e~*I { o. ®>(0), find the posterior Bayes estimator 
of 8. 

Let Xi, ...» X n be a random sample from f(x\8) = where 8 > 0. 

Assume that the prior distribution of © is given by 

g«(0) = [l/r(r)]At-‘r w / (0 , *>(8), 


where r and A are known. 

(a) What is the posterior distribution of ©? 

( b ) Find the Bayes estimator of 8 with respect to the given gamma prior distribu¬ 
tion using a squared-error loss function. 

Let AT be a single observation from the density fix \ 6) = (2jc/ 0 2 )/ (o . „,(*), where 
8 >0. Assume that © has a uniform prior distribution over the interval (0, 1). 
For the loss function fit; 8) = 8 \t— 8y, find the Bayes estimator of 8. 

Let Xi, X 2 , ..X„ be a random sample of size n from the following discrete 
density: 


fix;8) = 



8‘il - 0) 2 -*/(„. 


2 )(*), 


where 8 > 0. 

ia) Is there a unidimensional sufficient statistic? If so, is it complete? 

ib) Find a maximum-likelihood estimator of 8 2 = P[X, = 2]. Is it unbiased ? 

(c) Find an unbiased estimator of 8 whose variance coincides with the correspond¬ 
ing Cramer-Rao lower bound if such exists. If such an estimate does not 
exist, prove that it does not. 

id) Find a uniformly minimum-variance unbiased estimator of 6 2 if such exists. 

(e) Using the squared-error loss function find a Bayes estimator of 8 with respect 
to the beta prior distribution 

if) Using the squared-error loss function, find a minimax estimator of 8. 

ig) Find a mean-squared error consistent estimator of 8 2 . 
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48 Let Xi,X„ be a random sample from a Poisson density 

e~ 6 8* 

~ x {" J(0. 1 . ...)(*)» 

where 8 >0. For a squared-error loss function find the Bayes estimator of 8 for 
a gamma prior distribution. Find the posterior distribution of 0. Find the 
posterior Bayes estimator of t(8) = P[X, = 0]. 

49 Let Xi . X n be a random sample from f(x\ 8) = (l/0)/<o.<»Cx), where 8 >0. 

For the loss function (t — 0) 2 /6 2 and a prior distribution proportional to 
8~ */<i, * )(8) find the Bayes estimator of 8. 

50 Let Xi,X„ be a random sample from the Bernoulli distribution. Using the 
squared-error loss function, find that estimator of 6 which has minimum area 
under its risk function. 

51 Let Xu ■ - -, X„ be a random sample from the geometric density 

/(*;0) = 0( l-0)*/ (o .i. ...,(*) 


where 0 < 8 < 1. 

(a) Find a method-of-moments estimator of 6. 

(b ) Find a maximum-likelihood estimator of 8. 

(c) Find a maximum-likelihood estimator of the mean. 

(cl) Find the Cramer-Rao lower bound for the variance of unbiased estimators of 
1 - 8 . 

(e) Is there a function of 8 for which there exists an unbiased estimator the 
variance of which coincides with the Cramer-Rao lower bound? If so, 
find it. 

(/) Find the UMVUE of (1 — 8)/8 if such exists. 

( g ) Find the UMVUE of 8 if such exists. 

(h) Assume a uniform prior distribution and find the posterior distribution of 0. 
For a squared-error loss function, find the Bayes estimator of 8 with respect 
to a uniform prior distribution. 

52 Let 8 be the true I.Q. of a certain student. To measure his I.Q., the student takes 

a test, and it is known that his test scores are normally distributed with mean fi and 

standard deviation 5. 

(a) The student takes the I.Q. test and gets a score of 130. What is the maximum- 
likelihood estimate of 0? 

(b) Suppose that it is known that I.Q.’s of students of a certain age are distri¬ 
buted normally with mean 100 and variance 225; that is, 0 ~ iV(100, 225). 
Let X denote a student’s test score [X is distributed N(8, 25)]. Find the pos¬ 
terior distribution of 0 given X=x± What is the posterior Bayes estimate 
of the student’s I.Q. if X= 130. 

*53 Let Li,T„bea random sample from the density 


fix; a., 8) =f(x; oc, ft) = ft 1 e wm* *’/[«. «>(x), 
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where — oo < a < oo and /3 > 0. Show that 7i and 2 X are jointly sufficient. 
It can be shown that 7, and 2 — ^>) are joindy complete and independent of 

each other. Using such results, find the estimator of (a, /3) that has an ellipsoid of 
concentration that is contained in the ellipsoid of concentration of any other un¬ 
biased estimator of (a, ft)- (7, = min [X t ,...,X n ].) 

54 Let Xi . X* be a random sample from the density 

/(*; a, 0) = (1 - 0)0*- */(., „ + 1 , 

where — oo < a < oo and 0 < 0 < 1. 

(a) Find a two-dimensional set of sufficient statistics. 

*(b ) Find the maximum-likelihood estimator of (a, 0). 

55 Let Xi . X„ be a random sample from the density 

2x 2(1 — x ) 

fix; 6) = — Ifo.efx) H—t — g I(e, i](x). 


where 0 < 0 < 1. 

(a) Estimate 0 by the method of moments. 

(b) Find the maximum-likelihood estimator of 0 for n = 1 and n = 2. 

(c) For n = 1 find a complete sufficient statistic if such exists. Find a UMVUE 
of 0 for n = 1 if such exists. 

*(d) Find the maximum-likelihood estimator of 0. 
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1 INTRODUCTION AND SUMMARY 

Chapter VII dealt with the point estimation of a parameter, or more precisely, 
point estimation of a value of a function of a parameter. Such point estimates 
are quite useful, yet they leave something to be desired. In all those cases when 
the point estimator under consideration had a probability density function, the 
probability that the estimator actually equaled the value of the parameter being 
estimated was 0. (The probability that a continuous random variable equals 
any one value is 0.) Hence, it seems desirable that a point estimate should be 
accompanied by some measure of the possible error of the estimate. For in¬ 
stance, a point estimate might be accompanied by some interval about the point 
estimate together with some measure of assurance that the true value of the 
parameter lies within the interval. Instead of making the inference of estimating 
the true value of the parameter to be a point, we might make the inference of 
estimating that the true value of the parameter is contained in some interval. 
We then speak of interval estimation, which is to be the subject of this chapter. 

Like point estimation, the problem of interval estimation is twofold. First, 
there is the problem of finding interval estimators, and, second, there is the prob- 
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lem of determining good, or optimum, interval estimators. The considerations 
of these two problems that will appear in this chapter will be incomplete. 
Further considerations will be presented at the end of the next chapter on testing 
hypotheses. The mathematics of interval estimation and hypotheses testing are 
closely related. Either concept could be used to introduce the other. In this 
book, we have decided to introduce interval estimation first, right after our 
presentation of point estimation, then introduce hypotheses testing, and finally 
point out the close mathematical relationship between the two. 

The introduction to interval estimation that appears in this chapter will 
not be as thorough as was our discussion of point estimation in the last chapter. 
One should not infer from this that interval estimation is less important since in 
practice the opposite is usually true. It is just easier to present the basic theory 
of point estimation. No concerted effort will be given to the problem of finding 
optimum interval estimators. The chapter will be divided into six main sections, 
the first being this introductory section. Section 2 will be devoted to confidence 
intervals, where the notion is introduced and defined. One method of finding 
confidence intervals will also be given as well as some idea as to what an optimum 
confidence interval might be. Section 3 will consider several examples of con¬ 
fidence intervals that are associated with sampling from the normal distribution. 
Such discussion will hinge on the results of Sec. 4 of Chap. VI. Several general 
methods of finding confidence intervals are given in Sec. 4; another method, 
which utilizes the theory of hypotheses testing, will be given at the end of Chap. 
IX. A brief discussion of large-sample confidence intervals appears in Sec. 5, 
and Sec. 6 presents another type of interval estimation, namely, Bayesian interval 
estimation. 


2 CONFIDENCE INTERVALS 

2.1 An Introduction to Confidence Intervals 

In practice, estimates are often given in the form of the estimate plus or minus a 
certain amount. For instance, an electric charge may be estimated to be 
(4.770 ± .005)10“ 10 electrostatic unit with the idea that the first factor is very 
unlikely to be outside the range 4.765 to 4.775. A cost accountant for a publish¬ 
ing company in trying to allow for all factors which enter into the cost of produc¬ 
ing a certain book (actual production costs, proportion of plant overhead, pro¬ 
portion of executive salaries, etc.) may estimate the cost to be 83 + 4.5 cents per 
volume with the implication that the correct cost very probably lies between 
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78.5 and 87.5 cents per volume. The Bureau of Labor Statistics may estimate 
the number of unemployed in a certain area to be 2.4 + .3 million at a given 
time, feeling rather sure that the actual number is between 2.1 and 2.7 million. 
What we are saying is that in practice one is quite accustomed to seeing estimates 
in the form of intervals. 

In order to give precision to these ideas, we shall consider a particular 
example. Suppose that a random sample (1.2, 3.4, .6, 5.6) of four observations 
is drawn from a normal population with an unknown mean n and a known 
standard deviation 3. The maximum-likelihood estimate of fi is the mean of the 
sample observations: 

5c = 2.7. 


We wish to determine upper and lower limits which are rather certain to contain 
the true unknown parameter value between them. 

In general, for samples of size 4 from the given distribution the quantity 


will be normally distributed with mean 0 and unit variance, 
mean, and \ is a/y/n. Thus the quantity Z has a density 


i 

X is the sample 


/zOO = 0( z ) = 


V2rc 


-s * 2 


which is independent of the true value of the unknown parameter; so we can 
compute the probability that Z will be between any two arbitrarily chosen 
numbers. Thus, for example. 


. 1.96 

P[—1.96 <Z < 1.96] = I 4>(z)dz=. 95. 

•'- 1.96 


( 1 ) 


In this relation the inequality —1.96 < Z, or 


-1.96 < 




is equivalent to the inequality 


H <X + 1(1.96) = X + 2.94, 


and the inequality 


Z < 1.96 
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is equivalent to 

H>X - 2.94. 

We may therefore rewrite Eq. (1) in the form 

P[X - 2.94 < n < X + 2.94] = .95, (2) 

and substituting 2.7 for X we obtain the interval 

(-.24, 5.64). 

It is at this point that a certain abuse of language takes place since the 
random interval ( X - 2.94, X + 2.94) and the interval (-.24, 5.64) are each 
called a confidence interval, or more precisely a 95 percent confidence interval. 
[The interval (— .24,5.64) is the value of the random interval (X — 2.94, X + 2.94) 
when X = 2.7.] The meaning of Eq. (2) is the following: The probability that 
the random interval (X — 2.94, X + 2.94) covers the unknown true mean ft is .95. 
That is, if samples of size 4 were repeatedly drawn from the normal population 
and if the random interval (X — 2.94, X + 2.94) were computed for each sample, 
then the relative frequency of those intervals that contain the true unknown mean 
ft would approach 95 percent. We therefore have considerable confidence that 
the observed interval, here (— .24, 5.64), covers the true mean. The measure of 
our confidence is .95 because before the sample was drawn .95 was the prob¬ 
ability that the interval that we were going to construct would cover the true 
mean. .95 is called the confidence coefficient. 

Similarly, intervals with any desired degree of confidence between 0 and 1 
can be obtained. Thus, since 

P[—2.58 < Z < 2.58] = .99, 

a 99 percent confidence interval for the true mean is obtained by converting the 
inequalities as before to get 

P[X - 3.87 < ft <X + 3.87] = .99 

and then substituting 2.7 for X to get the interval (-1.17, 6.57). 

It is to be observed that there are, in fact, many possible intervals with the 
same probability (with the same confidence coefficient). Thus, for example, 
since 

P[— 1.68 <Z<2.70]= .95, 

another 95 percent confidence interval for ft is given by the interval (— 1.35, 5.22). 
This interval is inferior to the one obtained before because its length 6.57 is 
greater than the length 5.88 of the interval (—.24, 5.64); it gives less precise 
information about the location of ft. Any two numbers a and b such that 
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4>(z) 



95 percent of the area under 4>(z) lies between a and b will determine a 95 percent 
confidence interval. Ordinarily one would want the confidence interval to be 
as short as possible, and it is made so by making a and b as close together as 
possible because the relation P[a < Z < b] = .95 gives rise to a confidence in¬ 
terval of length (ffl-^/nXb — a). The distance b — a will be minimized for a 
fixed area when <p(a) = (j>(b), as is evident on referring to Fig. 1. If the point b 
is moved a short distance to the left, the point a will need to be moved a lesser 
distance to the left in order to keep the area the same; this operation decreases 
the length of the interval and will continue to do so as long as 4 >{b) < ^(a). 
Since <f>(z ) is symmetrical about z = 0 in the present example, the minimum value 
of b — a for a fixed area occurs when b = —a. Thus for x = 2.7, ( — .24, 5.64) 
gives the shortest 95 percent confidence interval, and (—1.17, 6.57) gives the 
shortest 99 percent confidence interval for n. 

In most problems it is not possible to construct confidence intervals which 
are shortest for a given confidence coefficient. In these cases one may wish to 
find a confidence interval which has the shortest expected length or is such that 
the probability that the confidence interval covers a value /i* is minimized, where 
H* # n- 

The method of finding a confidence interval that has been illustrated in the 
example above is a general method. The method entails finding, if possible, a 
function (the quantity Z above) of the sample and the parameter to be estimated 
which has a distribution independent of the parameter and any other parameters. 
Then any probability statement of the form P[a < Z < b] = y for known a and 
b, where Z is the function, will give rise to a probability statement about the 
parameter that we hope can be rewritten to give a confidence interval. This 
method, or technique, is fully described in Subsec. 2.3 below. This technique 
is applicable in many important problems, but in others it is not because in these 
others it is either impossible to find functions of the desired form or it is impos¬ 
sible to rewrite the derived probability statements. These latter problems can 
be dealt with by a more general technique to be described in Sec. 4. 

The idea of interval estimation can be extended to include simultaneous 
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FIGURE 2 


estimation of several parameters. Thus the two parameters of the normal distri¬ 
bution may be estimated by some plane region R in the so-called parameter 
space, that is, the space of all possible combinations of values of p and a 2 . A 
95 percent confidence region is a region constructible from the sample such that 
if samples were repeatedly drawn and a region constructed for each sample, 
95 percent of those regions in a long-term relative-frequency sense would include 
the true parameter point (ji 0 , ffo)(see Fig. 2). 

Confidence intervals and regions provide good illustrations of uncertain 
inferences. In Eq. (2) the inference is made that the interval — .24 to 5.64 covers 
the true parameter value, but that statement is not made categorically. A 
measure, .05, of the uncertainty of the inference is an essential part of the 
statement. 

2.2 Definition of Confidence Interval 

In the previous subsection we attempted to give some feel for the concept of a 
confidence interval by discussing a simple example. In this subsection we define, 
in general, what a confidence interval is and in the next subsection describe one 
method of finding confidence intervals. 

We assume that we have a random sample X 1 . X n from a density 

/(• ; 0) parameterized by 0. Previously, in Chap. VII, we considered point 
estimates of say t (0), some real function of 0. Now we look for an interval 
estimate of t(0). 

Definition 1 Confidence interval Let X u ..., X n be a random sample 
from the density/( • ; 0). Let 7\ = f 1 (X 1 ,X n ) and T 2 = J 2 (X 1 , ...,X n ) 
be two statistics satisfying 7\ < T 2 for which P e [T 1 < x (0) < T 2 ] = y, where 
y does not depend on 0; then the random interval (7\, T 2 ) is called a 100} 
percent confidence interval for t(0); y is called the confidence coefficient', and 
7\ and T 2 are called the lower and upper confidence limits, respectively, 
for x (0). A value (t 1 , t 2 ) of the random interval (7\, T 2 ) is also called a 
100}’ percent confidence interval for x (0). //// 
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We note that one or the other, but not both, of the two statistics 
/ 1 (X 1 , X n ) and t 2 (X^ ..., X n ) may be constant; that is, one of the two end 

points of the random interval (7\, T 2 ) may be constant. 


Definition 2 One-sided confidence interval Let X 1 ,..., X„ be a random 
sample from the density /{■',&). Let 7\ = / 1 (X 1 , X n ) be a statistic 
for which P e [T 1 < t(0)] = y; then T t is called a one-sided lower confidence 
limit for t(9). Similarly, let T 2 = t 2 (X x , ..., X n ) be a statistic for which 
P 9 [ t(0) < T 2 \ = y; then T 2 is called a one-sided upper confidence limit for 
t (9). (y does not depend on 9.) //// 


EXAMPLE 1 Let X 2 , ..., X n be a random sample from f(x; 9) = </> 0>9 (x). 
Set 7\=A(*!, X n ) = X-6l % fn and T 2 = t 2 (X x , ..., XJ = X + 
6/y/rr, then (T u T 2 ) constitutes a random interval and is a confidence 
interval for t(0) = 6 with confidence coefficient y = P g [X — 6/^Jn <0 < X + 
6/V«] = P g [- 2 < (X - 0)/( 3/yfii) < 2] = d>(2) - - 2) = .9772 - .0228 = 

.9544. Also, if a random sample of 25 observations has a sample mean 
of, say, 17.5, then the interval (17.5 — 6/y/25, 17.5 + 6/^25) is also 
called a 95.44 percent confidence interval for 9. *//// 


Remark If a confidence interval for 9 has been determined, then, in 
essence, a whole family of confidence intervals has been determined. 
More specifically, for a given lOOy percent confidence interval estimator of 
9 a 100y percent confidence-interval estimator of r(9) can be obtained, 
where t( •) is any strictly monotone function. For example if t( •) 
is a monotone, increasing function and (T 1 = / 1 (X 1 , X„), T 2 = 

^ 2 (X x , X n )) is a lOOy percent confidence interval for 9, then 
(t(7\), t(7' 2 )) is a lOOy percent confidence interval for t (9) since 

Pdfifi) < t (9) < r (T 2 )l = P e [T 1 <9 < T 2 ] = y. //// 


As was the case in point estimation, our problem is twofold: First, we need 
methods of finding a confidence interval, and, second, we need criteria fo^ 
comparing competing confidence intervals or for assessing the goodness of a 
confidence interval. In the next subsection, we will describe one method of 
finding confidence intervals and call it the pivotal-quantity method. 
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FIGURE 3 



2.3 Pivotal Quantity 

As before, we assume a random sample X u X H from some density/(•; 0) 
parameterized by 0. Our object is to find a confidence-interval estimate of t(0), 
a real-valued function of 0. 0 itself may be vector-valued. 

Definition 3 Pivotal quantity Let X u ..., X n be a random sample from 
the density/(•; 0). Let Q = <?(X x ,X K ;0); that is, let Q be a function 
of X lt ..., X n and 0. If Q has a distribution that does not depend on 0, 
then Q is defined to be a pivotal quantity. //// 


EXAMPLE 2 Let X lt ..., X n be a random sample from /(x; 0) = $ 9>9 (x). 
X — 0 is a pivotal quantity since X - 0 is normally distributed with mean 0 
and variance 9/n. Also (X — 0)/(3/ % /n) has a standard normal distribution 
and, hence, is a pivotal quantity. On the other hand, X/0 is not a pivotal 
quantity since X/6 is normally distributed with mean unity and variance 
9 /6 2 n, which depends on e. //// 


Our hope is to utilize a pivotal quantity to obtain a confidence interval. 

Pivotal-quantity method If Q = f(X t ,..., X n ; 0) is a pivotal quantity and 
has a probability density function, then for any fixed 0 < y < 1 there will exist q x 
and q 2 depending on y such that P[q l < Q < q 2 ] = y. Now, if for each possible 
sample value (x t ,..., x„), q x < ..., x„; 0) < q 2 if and only if / 1 (x 1 , ..., x„) < 

t(0) < t 2 (x x , ..., x„) for functions ^ and t 2 (not depending on 0), then (7^, T 2 ) 
is a lOOy percent confidence interval for r(0), where T t = t t ( X x , Jf„), i = 1, 2. 

Before illustrating the pivotal-quantity method with a simple example we 
make several comments. First, q± and q 2 are independent of 0 since the distribu¬ 
tion of Q is. Second, for any fixed y there are many possible pairs of numbers 
Vi and q 2 that can be selected so that P[q 1 < Q < q 2 ] = y. See Fig. 3. Different 
pairs of and q 2 will produce different and i 2 . We should want to select 
that pair of^ and q 2 that will make and t 2 close together in some sense. For 
instance, if 1 2 (X lt x„) — / 1 (X l ,, X„), which is the length of the confidence 
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interval, is not random, then we might select that pair of q x and q 2 that makes the 
length of the interval smallest; or if the length of the confidence interval is 
random, then we might select that pair of q t and q 2 that makes the average 
length of the interval smallest. 

As a third and final comment, note that the essential feature of the pivotal- 
quantity method is that the inequality {q r < ?(x u ..., x„; 0) < q 2 ) can be re¬ 
written, or inverted or “ pivoted,” as y i (x 1 , x n ) < r(6) < / 2 (x 1 , ..., x„)} for 
any possible sample value x v ..., x„. [This last comment indicates that 
“pivotal quantity” may be a misnomer since according to our definition Q = 
?(X t , X n ;Q) may be a pivotal quantity, yet it may be impossible to “pivot” 
it.] 


EXAMPLE 3 Let X u ..., X n be a random sample from 0 81 (x). Consider 
estimating t (d) = 9. Q = g(X l5 ..., X n ; 0) = (X — Q)/(y/TJn) has a 
standard normal distribution and, hence, is a pivotal quantity. f Q (q ) = 
< p(q ). For given y there exist q 1 and q 2 such that P[q x < Q <q 2 ~\=y (in 
fact, there exist many such q 1 and q 2 ). See Fig. 4. 

Now {q x < (x — 6)/y/l/n < q 2 } if and only if {x — q 2y /\/n <6 < 
x — q^y/l/n}; so (X — q 2 -JYjn, X — q ly /lln) is a lOOy percent confidence 
interval for 0. The length of the confidence interval is given by 
(X - q X yJ 1 /«) - (X - q 2 -J\jn) = ( q 2 - q^^/Tfn; so the length will be 
made smallest by selecting q x and q 2 so that q 2 — q i is a minimum under the 
restriction that y = P[q 1 <Q <q 2 ]= <b(q 2 ) - d>(^i), and q 2 - q x will be a 
minimum if q j = — q 2 , as can be seen from Fig. 4. //// 


The steps in the pivotal-quantity method of finding a confidence interval 
are two: First, find a pivotal quantity, and, second, invert it. We will comment 
further on techniques for finding pivotal quantities in Sec. 4. The method is 
thoroughly exploited in the next section. 
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3 SAMPLING FROM THE NORMAL DISTRIBUTION 

Let X t , ..., X n be a random sample from the normal distribution with mean n 
and variance <r 2 . The first three subsections of this section are generated by. 
the cases (i) confidence interval for n only, (ii) confidence interval for a 2 only, 
and (iii) simultaneous confidence interval for /i and <r. The fourth subsection 
considers a confidence interval for the difference between two means. 


3.1 Confidence Interval for the Mean 

There are really two cases to consider depending on whether or not a 2 is known. 
We leave the case a 2 known as an exercise. (The technique is given in Example 3.) 
We want a confidence interval for n when a 2 is unknown. In our general dis¬ 
cussion in Sec. 2 above our parameter was denoted by 9. Here 9 = (jx, <x), and 
t( 9) = fi. We need a pivotal quantity. (X — n)/(a/^/n) has a standard normal 
distribution; so it is a pivotal quantity, but {q 1 < (x — n)/(a/^/n) < q 2 } cannot be 
inverted to give ..., x„) < n < / 2 (x u • • • > *„)} for any statistics /] and t 2 . 

The problem with (A — n)/(a/y/n) seems to be the presence of <x. We look for a 
pivotal quantity involving only /i. We know that 

(X - n)l(o/Jn) X — n 
VI (X t - X) 2 /(n - 1)<t 2 S Ijn 

has a t distribution with n — 1 degrees of freedom. [Recall that S 2 = 
X i^i — X) 2 /( n — 1)-] So (X — n)/(S/y/n) has a density that is independent of 
H and a 2 ; hence it is a pivotal quantity. Now one has {q^ <(x - ft)/(d/y/n) < q 2 } 
if and only if {x — q 2 {ol~Jn) < n < x — qi(a/y/n)}, where q 1 and q 2 are such that 
p h i < & - fi)/(S/y/n) < q 2 ] = y- therefore, (X - q 2 (S/y/ri), X - q x (&/y/n)) is 
a lOOy percent confidence interval for fi. The length of this confidence interval 

i s (?2 — ?i)(§/v«)> which is random. For any given sample the length will be 
minimized if qy and q 2 are selected so that q 2 — q l is a minimum. A little 
reflection will convince one that q x and q 2 should be symmetrically selected about 
0, or the following argument can be advanced. We seek to minimize 


L = 


- 7 = (02 -<h) 

V" 


subject to 
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where f T (t) is the density of the t distribution with n — 1 degrees of freedom. 
Equation (3) gives q 2 as a function of q l , and differentiating Eq. (3) with respect 
to q t yields 


/r(? 2) /r(#i) — 


To minimize L, we set dLjdq t = 0; that is, 


^ = _L(^_A = o, 

d<h Jn\ d <h / 


but 


S 

y/n 



if and only if f T (q{) = f T (q 2 ), which implies that q 1 = q 2 [in which case 
/r(0 dt # y] or q x = — q 2 . q 1 = — q 2 is the desired solution, and such q 1 
and q 2 can be readily obtained from a table of the t distribution. 


3.2 Confidence Interval for the Variance 

Again there are two cases depending on whether or not /i is assumed known, and 
again we leave the case ji known as an exercise. We want a confidence interval 
for <t 2 when /r is unknown. We need a pivotal quantity that can be inverted. 
We know that 

„ E (*« - *) 2 ( n - 1)§ 2 

2 2 
/r 4 rr*- 


has a chi-square distribution with n — 1 degrees of freedom; hence Q is a pivotal 
quantity. Also, one has 


<h 


. (« - Do 2 



if and only if 


\ {n-\)o 2 <g2 ^ in - 1)«? 2 ' 
\ ?2 <h 


(n — 1)S 2 (n - 1)S 2 \ 
<12 <h / 


so 
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Chi-square density with 
1 degrees of freedom 


FIGURE 5 



is a 100-y percent confidence interval for a 2 , where q x and q 2 are given by 
P[<h <Q<q 2 ] = y. See Fig. 5. 

q x and q 2 are often selected so that P[Q < q t ] = P[Q > q 2 \ = (1 — y)/2. 
Such a confidence interval is sometimes referred to as the equal-tails confidence 
interval for <r 2 . q t and q 2 can be obtained from a table of the chi-square 
distribution. Again, we might be interested in selecting q x and q 2 so as to 
minimize the length, say L, of the confidence interval. 


L = (n — 



Let f Q (q) be a chi-square density with n — 1 degrees of freedom; then differen¬ 
tiating 

f« 

f Q (q) dq = y 

J qi 

with respect to q l yields 

^/q(<7 2 ) -/q(?i) = 0, 

and so 




= 0 , 


which implies that qff Q (q 1 ) — q 2 fQ(q 2 ). The length of the confidence interval 
will be minimized if q 1 and q 2 are selected so that 


subject to 

/*92 

I f<M)dq = y. 

A solution for q 1 and q 2 can be obtained by trial and error or numerical integra¬ 
tion. 



384 PARAMETRIC INTERVAL ESTIMATION 


vm 



1 

8 

(n - l)o 2 

X 2 02S 



lilll 



_ 

■ 

X (» - b« 2 

4,5 


FIGURE 6 x — 

In 

x + t. 97 Sy /d*/n 


We might note that for any q 1 and q 2 satisfying 


f fa(<i)dq 


= y. 



(n — 1)S 2 
92 



is a lOOy percent confidence interval for a. 


3.3 Simultaneous Confidence Region for the Mean and Variance 

In constructing a region for the joint estimation of the mean /i and variance a 2 
of a normal distribution, one might at first be inclined to use Subsec. 3.1 and 3.2 
above. That is, for example, one might construct a confidence region as in 
Fig. 6 by using the two relations 

P[X-t. 915% /S 2 fr<n<X + t' 91 sy /&/n] = .95 (4) 


and 


L X.975 


< (7 < 


(n — 1)S 2 " 

X. 2 025 . 


= .95, 


(5) 


where t -975 is the .975th quantile point of the t distribution with n - 1 degrees 
of freedom and x.o 2 5 and X .975 are the -025th quantile point and .975th quantile 
point, respectively, of the chi-square distribution with n — 1 degrees of freedom. 
The region displayed in Fig. 6 does indeed give a confidence region for (ji, a 2 ), 
but we do not know what its corresponding confidence coefficient is. [It is not 
,95 2 since the two events given in Eqs. (4) and (5) are not independent.] 

A confidence region, whose corresponding confidence coefficient can be 
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readily evaluated, may be set up, however, using the independence of X and S 2 . 
Since 


<2i 


X-ft 

o/Jn 


and 


Qi = 


(n - 1)S 2 


are each pivotal quantities, we may find numbers q u q' 2 , and q 2 such that 


and 


Also, since gi 


X-ii 


P -tfi <- t < 0i 

L a/y/n 


= 7i 


r, («-i )§ 2 ,/ 

P 02 < 2 < 02 

a 


= 72 - 


and Q 2 are independent, we have the joint probability 


-0i < 


X — ix 


T l\/ n 


— < 0i; 02 < 


(n — 1)S 2 


?2 = 


< 02 = 7l?2 • 


( 6 ) 

(7) 

( 8 ) 


The four inequalities in Eq. (8) determine a region in the parameter space, which 
is easily found by plotting its boundaries. One merely replaces the inequality 
signs by equality signs and plots each of the four resulting equations as functions 
of n and a 2 . A region such as the shaded area in Fig. 7 will result. 

We might note that a confidence region for ( /x , a) could be obtained in 
exactly the same way; the equations would be plotted as functions of it instead of 
<r 2 , and the parabola in Fig. 7 would become a pair of straight lines given by 
H = x ± q A ajy/n intersecting at x on the fx axis. 

The region that we have constructed does not have minimum area for 
fixed yj and y 2 . Its advantage is that it is easily constructible from existing 
tables and it will differ but little from the region of minimum area unless the 
sample size is small. The region of minimum area is roughly elliptic in shape 
and difficult to construct. 
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3.4 Confidence Interval for Difference in Means 

Let X u ..., X m be a random sample of size m from a normal distribution with 
mean Hi and variance a 2 , and let Y u ..., Y„ be a random sample of size n from 
a normal distribution with mean n 2 and variance a 2 . Assume that the two 
samples are independent of each other. We want a confidence interval for 
H 2 ~ l* i- Y— X is normally distributed with mean jx 2 — Hi and variance 
a 2 /n + a 2 /m. Z (A ; — X) 2 /a 2 is chi-square-distributed with m — 1 degrees of 
freedom, and £ (— Y) 2 /<r 2 is chi-square-distributed with n— 1 degrees of 
freedom; hence 

Z(^-*) 2 , £(r t - y) 2 

2 ' 2 
/r 4 /T 4 


is chi-square-distributed with m + n — 2 degrees of freedom. Finally, 


n ^ [( y - X) - (ji 2 - Hi)]/y/<r 2 lm + <J 2 /n 

x / tZ (^ -^) 2_ +Z (Yi - Y) 2 ]/a 2 (m + n — 2) 

( Y- X) -Jp 2- m) 

s/Qjm + l/«) [Z ~ ^) 2 + Z ( y i ~ Z> 2 ]/(™ + n - 2) 

\/ (!/ m + l/«)(§p) 

has a t distribution with m + n — 2 degrees of freedom. Thus it follows that 
y = P[- t (1+y)/2 <Q< t ( i + V )/ 2 ]> where t (1 +y)/2 is the [(1 + y)/2]th quantile point 
of the t distribution with m + n — 2 degrees of freedom. S 2 is an unbiased 
estimator of the common variance a 2 . (The subscript p can be thought of as an 
abbreviation for “pooled”; S 2 is a pooled estimator of a 2 , the two samples 
being pooled together.) Now 


(y - x) - (p 2 - pi) _ 

- f(l + y>/2 < 7 =7- 7TVT~ < r Cl +y)/2 


V(1 l m + 1 /n)< 

y-x)~ t (1+y)l2 + Ly p <p 2 -H 1 <(y-^) + hi+ymj ^ 


if and only if 

(y 

hence 


(O' - X) - >,...)«^ + 1 )^.< E- X) + 

is a 100y percent confidence interval for Hi ~ di- 


( 9 ) 
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We assumed above that we had two independent random samples. Now 
assume that {X u Fi), F„) is a random sample from the bivariate normal 

distribution with parameters given by p x = S[X], p 2 = S[ Y], a\ = var[2f|, 
<s\ = var[F], and p — cov [X,Y]/a l a 2 . The object is to find a confidence- 
interval estimate of p 2 ~ Ah- Let D t = F, — X t , i = 1,..., n; then D u ..., D n 
are independent and identically distributed random variables with common 
normal distribution having mean p D = p 2 ~ Mi and variance = a\ + — 

2pa l a 1 . Pretending that D u ■ ■ ■, D n is our random sample and proceeding as 
in Subsec. 3.1, we obtain the following 100y percent confidence interval for 
Mz - Mi '■ 



(1 + V)/2. 


' E(A-g ) 2 

«(«- 1 ) 


,D + 


t (l + y)/2 


I (A - D) 2 \ 

«(«-!) ) 


( 10 ) 


where t (1+y)/2 is the [(1 + y)/2]th quantile point of the t distribution with n — 1 
degrees of freedom. The above obtained confidence interval for p 2 — p 1 is 
often referred to as the confidence interval for the difference in means for paired 
observations. The ith X observation is paired with the ith F observation. 


4 METHODS OF FINDING CONFIDENCE INTERVALS 

In this section we will discuss two methods of obtaining a confidence interval. 
A third method will be presented in Chap. IX. 


4.1 Pivotal-quantity Method 

We described the pivotal-quantity method of finding confidence intervals in 
Subsec. 2.3, but we left unanswered the question of whether or not a pivotal 
quantity would actually exist for a given problem. The following remark gives 
a partial answer to this question. 


Remark If X u ..., X n is a random sample from /(■; 0), for which the 
corresponding cumulative distribution function F(x; 0) is continuous 
in x, then, by the probability integral transform, F(Z ; ; 0) has a uniform 
distribution over the interval (0, 1). Hence -log F(X t \ 0) has the den¬ 
sity e-"/ ( o, *)(«) since P[-log F{X i \ 0) > m] = p[l og F(*. ; 0) < -u] = 
P[F(X t \6) <e ] = e for u > 0. Finally — £ log F[Xi, 0) has a gamma 
distribution with parameters n and 1; that is, 
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-logq 2 <- £logF(X f ; 0) < —logqj 
i= 1 


i* ~ log (J 1 l 


z n ~ 1 e~ z dz 


= r 

* -iog«2 r(^) 

= p|?i < f\ F (X t -,0)<q 2 ^ for 0<q 1 <q 2 <l. 


( 11 ) 


So 


n%;«). or -£log F(X ; ;0), 


i=l 


i- 1 


is a pivotal quantity. 


III! 


The remark shows that any time that we sample from a population having 
a continuous cumulative distribution function, a pivotal quantity exists. Note 
that as of yet we have no assurance that the pivotal quantity exhibited by the 
remark can be utilized to find a confidence interval. If, however, F(x; 0) is 

n 

monotone in 0 for each x, then [ { F(x 0) is also monotone in 0 for each 

i = 1 

. .x„, and such monotonicity allows one to find a confidence interval for 0. 

n 

We see from Fig. 8 that^ < H F ( x t\ 0) if and only if / 1 (x 1 ,x n ) <0 < 

i= 1 

/ 2 (*i, ■. ■, x„), where and t 2 are defined as indicated. 


EXAMPLE 4 Let X t , ..., X n be a random sample from the density f(x \ 0) = 
Ax^VdCx); I* 1611 F ( x ’ 0 ) = x"/(o,i>(x) + / t i,oo)(x). If 9, and q 2 are 
selected [see Eq. (11)] so that 


y = p[q l <f\F(X i -0)<q 2 ^ 

=p^i< nx?< fl2 j 


log q i <01og Y\Xi<\ogq 2 


= P 


n 


-log q 2 < -0 log n^< 

i = 1 



= P 


log<? 2 log?, 

--- < u < --- 


log]!*; 

t= 1 


logflX; 

1=1 
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then 


log q 2 log q t 


^logfljr, logn^.j 


is a 100-y percent confidence interval for 0. 


IllI 


We conclude this subsection with two further comments regarding the 
existence of pivotal quantities. First, if 0 is a location parameter, then X t — 0 
has a distribution independent ot 0 by definition and, hence, is a pivotal quantity 
as are a variety of other random quantities, including £ X t — nO, Y } — 0, Y x + 
y„ — 20, etc. Second, if 0 is a scale parameter, then by definition X-J6 is 
distributed independently of 0 and, hence, is a pivotal quantity as are X-JQ, 
Yj/6 , etc. 


4.2 Statistical Method 

♦ 

As usual, we assume that we have a random sample X u ..., X n from the density 
/(•; 0 O ). We further assume that the parameter 0 O is real and that the param¬ 
eter space 0 is some interval. (In this subsection, we will let 0 O denote the 
true parameter value.) We seek an interval estimate of 0 O itself. Let T = 
t(X \,..., X„) be some statistic. The statistic T can be selected in several ways. 
For instance, if a sufficient statistic (unidimensional) exists, then T could be 
taken to be a sufficient statistic; or if no sufficient statistic exists, Tcould be taken 
to be a point estimator, possibly the maximum-likelihood estimator, of 0 O . 
The actual choice of T might depend on the ease with which the operations that 
need to be performed to obtain the confidence interval can be performed. One 
of those operations will be the determination of the density of T. 
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Let f T (t\ 0) denote the density of T. We will proceed as though T is a 
continuous random variable; although the technique will also work for T as a 
discrete random variable. We can define two functions, say ^(0) and /z 2 (0), as 
follows: 

.hi(8) .co 

f T (t; 0) dt = pi and f T {t ; 0) dt = p 2 , (12) 

J -00 J h 2 (0) 

where p t and p 2 are two fixed numbers satisfying 0 < p u 0 <p 2 , and Pi + p 2 < 1. 
See Fig. 9. 

/q(0) and h 2 (0) can be plotted as functions of 0. We will assume that both 
h^-) and h 2 {-) are strictly monotone, and for our sketch we will assume that 
they are monotone, increasing functions. We know that /z t (0) < h 2 (0). See 
Fig. 10. 

Let t 0 denote an observed value of T\ that is, t 0 = /(x,.x„) for an 

observed random sample x u .. ., x„. Plot the value of t 0 on the vertical axis in 
Fig. 10, and then find and v 2 as indicated. For any possible value of t 0 , a 
corresponding and v 2 can be obtained, so and v 2 are functions of t 0 ; denote 
these by vi = t>i(t 0 ) and v 2 = v 2 (t 0 ). The interval (V 1 , V 2 ) will turn out to be a 
100(1 — Pi — P 2 ) percent confidence interval for 0 O . To argue that this is so, 
let us repeat Fig. 10 as Fig. 11 and add to it. (Figure 10 indicates the method of 
finding the confidence interval.) 


f 
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We see from Fig. 11 that h^Q 0 ) <t 0 = /(x lt ..., x n ) < h 2 (0 o ) if and only if 
= Vi(Xi, ..., x„) < 0 Q <v 2 = v 2 {xi, ..., x„) for any possible observed sample 
(xj, ..., x„). But by definition of A 1 (-) and 

Pe O l h i( d o) < , ■■■,*„)< h 2 (0 o )] = 1 - Pi - p 2 \ 

so 

Pe ■ ■ ■> -^n) <&0 < v 2(^l> • ■ -^n)] “ 1 ~ Pi ~ ?2! 

that is, as stated, (V 2 , V 2 ) is a 100(1 — p 2 — p 2 ) percent confidence interval for 
0 o , where V t = v^X lt ■ ■ ■ > -^0 for * = 1,2. 

We might note that the above procedure would work even if hi(-) and 
h 2 (-) were not monotone functions, only then we would obtain a confidence 
region (often in the form of a set of intervals) instead of a confidence interval. 


EXAMPLE 5 Let X t , .. ., X n be a random sample from the density /(x; 9 0 ) = 
O/0oV(o, 0 O >( X )- We want a confidence interval for 6 0 . Y„ = 
max [X t ,..., A" B ] is known to be a sufficient statistic; it is also the maximum- 
likelihood estimator of 0 O . We will use Y n as our statistic T that appears 
in the above discussion; then 

/r(';0) = «Qy ^wo- 

For given p x and p 2 , find h^Q) and h 2 {0). p 2 = n t n ~ i 6~ n dt implies 
that J£ ,(9) r" _1 dt = d"Pi/n, which in turn implies [h^Y/n = ff'pi/n, or 
finally h^O) = 9p\ ln . Similarly, p 2 =\l m nt n ~ i 9~ n dt implies that 
8" - [hiiO)]” = 0 n p 2 or h 2 (0 ) = 0(1 — p 2 ) 1/n . See Fig. 12, which is Fig. 10 
for the example at hand. 
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For observed t 0 = max [x t ,..., x„], v 2 is such that h 2 (v 1 ) = f 0 ; 
that is, h 2 (v l ) = v x (l — p 2 ) lln = t 0 or v 2 = t 0 (l — p 2 )~ lln . Similarly, 
v 2 = t 0 p~ lln . So a 100(1 — p i — p 2 ) percent confidence interval for 9 0 is 
given by (F„(l — /> 2 )~ 1/n > Y„Pi 1/n ). We could worry about selecting p x 
and p 2 so that the confidence interval is shortest subject to the restriction 
that 1 — Pi — p 2 = y. The length of the confidence interval is 

L= Y„[pT lln -(.l-p 2 )- lln ] 

and so the length will be shortest if p t and p 2 are picked so as to minimize 
^r 1/n -(l-/’ 2 )" 1/n subject to l-pi-p 2 =y and 0 <p 2 +p 2 <1, 
which is accomplished by picking p 2 = 0 and p x = 1 — y. 

We might note that YJ6 is a pivotal quantity and a confidence 
interval for 0 can be obtained more easily using the pivotal-quantity 
method. //// 

We observe in the example above , and in general for that matter , that 
hfO) and h 2 (0) are really not needed. For a given observed value t 0 = 
J(x u ..., x„) of the statistic T, we need to find v 2 = ■v 1 (x 1 , xj) and v 2 = 
-v 2 (x 1 , , x„). v 2 can be found by solving for 6 in the equation 

»hi(9)-to 

P i= fr( t ',0)dt■ 

J - 00 

v 2 is the solution; v l can be found by solving for 6 in the equation 

P2=C (14) 

J h 2 (0) = to 

v 2 is the solution. 

We mentioned at the outset that the method would work for discrete 
random variables as well as for continuous random variables. Then the inte¬ 
grals in Eqs. (12) to (14) would need to be replaced by summations. Two 
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popular discrete density functions are the Bernoulli and Poisson. One could be 
interested in confidence-interval estimates of the parameters in each. In 
Example 6 to follow we will consider the Bernoulli density function; the Poisson 
case is left as an exercise. 


EXAMPLE 6 Let X lt ..., X n be a random sample from the Bernoulli density; 
that is, P[X = 1] = 0 O = 1 - P[X = 0], We know that T = £ X, is a 
sufficient statistic; furthermore, T has a binomial distribution; that is, 

p[T= t] = - 0 O for t = 0, 1, ..., n. We want a confidence- 

interval estimate for 0 O . Suppose we observe T = t 0 (necessarily an inte¬ 
ger). According to Eqs. (13) and (14) we need to solve for 0 in each of 
the equations 

Pi=| o (”)e'0 -0)""' 

and 

P2=£(")0'(i-0r‘- 

To actually solve these equations, a table of the binomial distribution is 
useful. If p t = .0509, p 2 = .0159, n = 20, and if T = 4 is observed then 
the 93.33 percent confidence interval (.05, .40) for 0„ is obtained. //// 


5 LARGE-SAMPLE CONFIDENCE INTERVALS 

We have seen in our studies of point estimation that it is sometimes possible to 

find a sequence of estimators, say T n = / n (Xi . X n ), of 0 in a density/(■; 0) 

that are asymptotically normally distributed about 0; that is, T n is approximately 
normally distributed with mean 0 and variance, say, cr 2 (0), where g 2 (6) indicates 
that the variance is a function of 0 (since it will ordinarily depend on 0) and the 
sample size n. In particular, we have seen in Sec. 9 of Chap. VII that for large 
samples the maximum-likelihood estimator, say 0„ = S B (jf ... ; x„), for a 
parameter 0 in a density/(■; 0) is approximately normally distributed about 0 
under rather general conditions. The large-sample variance of the maximum- 
likelihood estimator was seen to be, say, 

g\(Q) = —-1_ __~ 1_ 

n<g 0 [{(d/dd) log f(X; 0)} 2 ] n&\[(d 2 jdQ 2 ) log f{X\ 0)]' 


( 15 ) 
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When such a sequence of asymptotically normally distributed estimators {T n } of 
9 exists, it is sometimes possible to obtain approximate confidence intervals 
quite easily. ( T n — 6)/ff n (6) can be treated as an approximate pivotal quantity, 
and, therefore, for large sample size a confidence interval with an approximate 
confidence coefficient y may be determined by converting the inequalities 


T _ Q 

— z < ——— 7 ^ 7 - < z, (16) 




where z = z (1+y)/2 is defined by 4>(z (1+y)/2 ) = (1 + y)/2 or 4>(z) - <h(-z) = y. 
The above described method will always work to find a large-sample confidence 
interval provided the inequality — z<(T n — 9)ja n {9) < z can be inverted. 


EXAMPLE 7 Let X u ..., X n be a random sample from the density /(x; 9) = 
9e~ ex I (0tCO) (x). We know (see Example 51 in Chap. VII) that the maxi- 
mum-likelihood estimator of 9 , which is l/X„, has an asymptotic normal 
distribution with mean 9 and variance equal to 


Therefore, 


and hence 


° 2 »(0) = 


1 e 2 

n*,mdO ) log AX; 9)} 2 ) ~ 7 


b 


y ~-P -z < 


i/x n - e 


J~ zd 1 

Ll + zjy/n 



/ l/X n 1 IX. \ 

\1 + z/y/n 1 — z/yfn) 


is a large-sample confidence interval for 9 with an approximate confidence 
coefficient y, where z is given by <b(z) — d>( —z) = y. //// 


EXAMPLE 8 Consider sampling from the Bernoulli distribution with param¬ 
eter 9 = P[X = 1 ] = 1 — P[X = 0]. The maximum-likelihood estimator 
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of 0 is 0 = X n , and it has variance er *(0) = 0(1 - 9)/n. An approximate 
100 }’ percent confidence interval for 0 is obtained by converting the in¬ 
equalities in 


p [ ^ ©-0 
L Z< v / 0(1 -9)jn 



~y 


to get 

p«© + z 2 — z^/4n& + z 2 — 4 n& 2 
L 2(n + z 2 ) 

2n© +z 2 + z^/4n& + z 2 - 4 «© 2 ] 

<9< - 2 (^ 7 ) -J“>" <17) 


These expressions for the limits may be simplified since in deriving the 
large-sample distribution certain terms containing the factor 1 j^fn are 
neglected; that is, the asymptotic normal distribution is correct only to 
within error terms of size a constant times l/y/n. We may therefore 
neglect terms of this order in the limits in Eq. (17) without appreciably 
affecting the accuracy of the approximation. This means simply that we 
may omit all the z 2 terms in Eq. (17) because they always occur added to a 
term with factor n and will be negligible relative to n when n is large to 
within the degree of approximation that we are assuming. Thus Eq. (17) 
may be rewritten as 


0 -z 


©(1 -©) 


< 0 < © + z 


0(1 - 0)1 


In particular, 


(18) 


•K 


1.96 


©(1 -©) 


- <0 <0 + 1.96 


0(1 - 0)1 


.95 


gives an approximate 95 percent confidence interval for 0 
samples. 


for large 

//// 


We may observe that Eq. (18) is just the expression that would have been 
obtained had © been substituted for 0 in <t^( 0). The substitution would imply 
that 


©-0 


V'©(! - ©)/« 
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is approximately normally distributed with mean 0 and unit variance. It is, 
in fact, true in general that in the asymptotic normal distribution of a maximum- 
likelihood estimator © the variance er^( 0 ) may be replaced by its estimator a^( 0 ) 
without appreciably affecting the accuracy of the approximation. We shall not 
prove this fact but shall use it because it greatly simplifies the conversion of the 
inequalities that is required to get the large-sample confidence intervals. For 
instance, 

rJ 1 

L ff.(0) J 

readily converts to give 

F[© - z<t„(S) < 0 < © + zff„(©)] « y, (19) 

where © is the asymptotically normally distributed maximum-likelihood esti¬ 
mator of 9, c„( 0 ) in this expression is the maximum-likelihood estimator of 
a n (9) (which is the large-sample standard deviation of 0), and z is given by 
®(z) — <D(—z) = y. 

We noted in Sec. 9 of Chap. VII that under regularity conditions the joint 
distribution of the maximum-likelihood estimators of the components of a 
^-dimensional parameter is asymptotically normally distributed. Although we 
will not so argue, such a result could be used to obtain a large-sample confidence 
region. 

The large-sample confidence intervals presented in this section have an 
optimum property which we shall point out but not prove. Recall that in the 
earlier sections, particularly Sec. 3, of this chapter we were concerned with finding 
the shortest interval for a given probability. Loosely speaking, an analogous 
optimum property of large-sample confidence intervals based on maximum- 
likelihood estimators is this: Large-sample confidence intervals based on the 
maximum-likelihood estimator will be shorter , on the average, than intervals deter¬ 
mined by any other estimator. 


6 BAYESIAN INTERVAL ESTIMATES 

In Sec. 7 of Chap. VII we examined what is called Bayes estimation. There 
we assumed that a random sample, say X lt ..., X n , from some density /(■; 0) = 
/(• | 0 ) was available, where the form of the function/(■ | ■) was known and the 
fixed value of 0 was unknown. We further assumed that the unknown fixed 
value of 9 was the value of a random variable 0 with known density, denoted by 
g e ( ■) and called the prior density of 0. We then used this additional knowledge 
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of a known prior density to define the posterior distribution of 0, and from this 
posterior distribution we defined the posterior Bayes point estimator of 9. In 
this section we use this same posterior distribution of © to arrive at an interval 
estimator of 9. 

If/(-10) is the density sampled from and# s (-) is the prior density of ©, 
then the posterior density of © given (X u ..., X K ) = (x 1} ..., x„) is [recall Eq. (19) 
of Chap. VII] 


/« IX, = xi, .... X„ = x„(0 I X l> ‘ ‘ ‘ » X ") : 


n/w)] 


ffe(0) 


j[mx { \e)]ge{0)dd. 

For fixed y, any interval, say (q, t 2 ), satisfying 

J /e|x,=»,.x.=x„(0| x i> ■ ■ ■» *n) d9 = y 


( 20 ) 


( 21 ) 


is defined to be a lOOy percent Bayesian interval estimate of 9. In practice, one 
would naturally pick those q and t 2 satisfying Eq. (21) for which t 2 — t x is smallest. 
Note that r t - = /.-(xj, ..., x„); that is, t t is some function of the observations 
x i> ■ ■ ■ > x n . 


EXAMPLE 9 Let X x , be a random sample from the normal density 

with mean 9 and variance 1. Assume that © has a normal density with 
mean x 0 and variance 1. Consider estimating 9. We saw in Example 44 
of Chap. VII that the posterior distribution of © is normal with mean 

n 

X x i/(n + 1) and variance 1 /(« + 1). We seek t x and t 2 satisfying 


J /s|Xi =xi,..., x„*=x„(91 Xi, ..., x„) 


d9 


h-t x il(n + 1)\ L - J Xf/(n + 1)’ 

= <hl- 0 —) - 0)1 - 2 - 

\ VI /(« + 1) / \ y/l l(n + 1) / 

If z is such that <t>(z) — 0(—z) = y, then 


(22) 




*2 — 


£*.• 


v ’ , i , i 

—— + z /-- and 1 1 =- z /- 

^ + 1 \ n 1 n 1 \l n 1 
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gives the shortest lOOy percent Bayesian interval estimate of d. Note that 
the corresponding 100/ percent confidence-interval estimate of 6 is given 


b 4 xjn — Zy/l/n, £ xjn 4- z^f I /nj. The only difference in the results 

of the two methods for this example is that the sample size seems to 
increase by 1 and the apparent “ additional observation ” is the mean of 
the assumed prior normal distribution. //// 


PROBLEMS 

1 Let A" be a single observation from the density 

f(x; 6) = 6x°-'I {0 , ,,(*), 

where 8 > 0. 

(a) Find a pivotal quantity, and use it to find a confidence-interval estimator of 8. 

(b) Show that (y/2, y) is a confidence interval for 8. Find its confidence coef¬ 
ficient. Also, find a better confidence interval for 6. Define Y = — 1/log X. 

2 Let X ,,..., X„ be a random sample from N(0, 8), 8 >0. Give an example of a 
pivotal quantity, and use it to obtain a confidence-interval estimator of 8. 

3 Suppose that T, is a lOOy percent lower confidence limit for t(0) and 70 is a lOOy 
percent upper confidence limit for t(0). Further assume that P e [T, < 70] = 1. 
Find a 100(2y — 1) percent confidence interval for t(0). (Assume y > 1.) 

4 Let Xi, .... X„ denote a random sample from f(x; 0) = I (0 ^ 0+i) (x). Let 
Yi < ■ < y„ be the corresponding ordered sample. Show that ( Y u Y„) is a 
confidence interval for 8. Find its confidence coefficient. 

5 Let X,,X„ be a random sample from f(x; 0) = 8e~ ,x I {0 . „)(x). 

(a) Find a lOOy percent confidence interval for the mean of the population. 

(b) Do the same for the variance of the population. 

(c) What is the probability that these intervals cover the true mean and true 
variance, simultaneously? 

(cl) Find a confidence-interval estimator of e~ e = P[X > 1]. 

(e) Find a pivotal quantity based only on Y,, and use it to find a confidence- 
interval estimator of 8. (y, = min [A",.A/,].) 

6 A' is a single observation from 8e~ ex I (0 . «,)(*), where 8 > 0. 

(a) (A, 2AO is a confidence interval for 1 /8. What is its confidence coefficient? 

(b) Find another confidence interval for 1/8 that has the same coefficient but 
smaller expected length. 

7 Let Ai, AO denote a random sample of size 2 from N(8, 1). Let Y, < Y 2 be the 
corresponding ordered sample. 

(a) Determine yin^y, <8 < Y 2 ] = y. Find the expected length of the interval 
(Y lt Y 2 ). 

( b ) Find that confidence-interval estimator for 8 using X — 8 as a pivotal quantity 
that has a confidence coefficient y, and compare the length with the expected 
length in part (a). 
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8 Consider random sampling from a normal distribution with mean (j. and variance 

<T 2 . 

(a) Derive a confidence interval estimator of ^ when a 2 is known. 

(b) Derive a confidence interval estimator of a 2 when n is known. 

9 Find a 90 percent confidence interval for the mean of a normal distribution with 
a=3 given the sample (3.3, —.3, -.6, -.9). What would be the confidence 
interval if o were unknown? 

10 The breaking strengths in pounds of five specimens of manila rope of diameter 
■ys inch were found to be 660, 460, 540, 580, and 550. 

(a) Estimate the mean breaking strength by a 95 percent confidence interval 
assuming normality. 

(b) Estimate the point at which only 5 percent of such specimens would be 
expected to break. 

(c) Estimate a 1 by a 90 percent confidence interval; also a. 

(cl) Plot an 81 percent confidence region for the joint estimation of fj. and <r 2 ; for 
yu. and o. 

11 A sample was drawn from each of five populations assumed to be normal with the 
same variance. The values of (n — 1)S 2 = ^(X,— X) 2 and /?, the sample size, 
were 

S 2 : 40 30 20 42 50 

n: 6 4 3 7 8 

Find 98 percent confidence limits for the common variance. 

12 Develop a method for estimating the ratio of variances of two normal populations 
by a confidence interval. 

13 What is the probability that the length of a / confidence interval for ft when 

sampling from a normal distribution will be less than cr for samples of size 20? 

14 In sampling from a normal population compare the average length of the two 
confidence intervals for the mean p when (a) a is known and (b) o is unknown. 

15 Show that the length and the variance of the length of the t confidence interval 
for fj. when sampling from a normal population approach 0 with increasing sample 
size. 

16 In sampling from a normal population with both /j, and a unknown, how large a 
sample must be drawn to make the probability .95 that a 90 percent confidence 
interval for jj. will have length less than oj 5? 

17 Show that the length of the confidence interval for a of a normal population 
approaches 0 with increasing sample size. 

18 To test two promising new lines of hybrid corn under normal farming conditions, 
a seed company selected eight farms at random in Iowa and planted both lines in 
experimental plots on each farm. The yields (converted to bushels per acre) for 
the eight locations were 

Line A: 86 87 56 93 84 93 75 79 

Line B: 80 79 58 91 77 82 74 66 

Assuming that the two yields are jointly normally distributed, estimate the difference 
between the mean yields by a 95 percent confidence interval. 
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19 X,, ..., X„ is a random sample from (l/0)x<‘ ~ e,l6 I [0 . n (.v), where 0 > 0. Find a 
lOOy percent confidence interval for 0. Find its expected length. Find the 
limiting expected length of your confidence interval. Find n such that 
/’[length < 80] > p for fixed 8 and p. (You may use the central-limit theorem.) 

20 Develop a method for estimating the parameter of the Poisson distribution by a 
confidence interval. 

21 Find a good lOOy percent confidence interval for 0 when sampling from fix; 6) = 
/(*-*. «+*>(*)■ 

22 Find a good lOOy percent confidence interval for 8 when sampling from fix; 8) = 
(2x/8 2 )I f0 . »)(*), where 8 > 0. 

23 One head and two tails resulted when a coin was tossed three times. Find a 
90 percent confidence interval for the probability of a head. 

24 Suppose that 175 heads and 225 tails resulted from 400 tosses of a coin. Find a 
90 percent confidence interval for the probability of a head. Find a 99 percent 
confidence interval. Does this appear to be a true coin ? 

25 Let Xu ..., X„ be a random sample from/(x; 6) =/(x; p., 6) — ^„. 0 2 (x). Define 
t(6) by /,“») <f>,t . 0 2 (x) dx — a(a is fixed). Recall what the UMVUE of r(8) is. Find 
a lOOy percent confidence interval for r(0). (If you cannot find an exact lOOy per¬ 
cent confidence interval, find an approximate one). 

26 Let X u X n be a random sample from f(x; 6) = <f> e . ,(x). Assume that the 
prior distribution of © is A(^ 0 , o^,),fL 0 and unknown. Find a lOOy percent Bayesian 
interval estimator of 8, and compare it with the corresponding confidence interval. 

27 Let Xi, .... X n be a random sample from fix 10) = 8x e ~'I ( o. n(x), where 0 > 0. 
Assume that the prior distribution of 0 is given by 

»e(0 ) = f ^A’0’- 1 e-«/ ( °. E) (0), 

where r and A are known. Find a 95 percent Bayesian interval estimator of 0. 

*28 Let X denote the life in hours of a radioactive particle. Suppose X has a density 

f(x; 8) — 8e~ ex f 0 , «,fx). 

A random sample of n particles is put under observation, but the experiment is to 
stop when the Ath particle has expired; i.e., it is intended not to wait until all the 
particles have ceased activity but only until k of them (A fixed in advance) have 
done so. The data consist of the k measurements Y u , Y k and n — k measure¬ 
ments known only to exceed Y k , where Y, is the lifetime of the ith particle to 
expire. Find the maximum-likelihood estimator of the mean lifetime 1/0. Also 
find a confidence-interval estimator of 1 /0. 
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TESTS OF HYPOTHESES 


1 INTRODUCTION AND SUMMARY 

There are two major areas of statistical inference: the estimation of parameters 
and the testing of hypotheses. We shall study the second of these two areas in 
this chapter. Our aim will be to develop general methods for testing hypotheses 
and to apply those methods to some common problems. The methods developed 
will be of further use in later chapters. 

In experimental research, the object is sometimes merely to estimate param¬ 
eters. Thus one may wish to estimate the yield of a new hybrid line of corn. 
But more often the ultimate purpose will involve some use of the estimate. 
One may wish, for example, to compare the yield of the new line with that of a 
standard line and perhaps recommend that the new line replace the standard 
line if it appears superior. This is a common situation in research. One may 
wish to determine whether a new method of sealing light bulbs will increase 
the life of the bulbs, whether a new germicide is more effective in treating a 
certain infection than a standard germicide, whether one method of preserving 
foods is better than another insofar as retention of vitamins is concerned, 
and so on. 
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Using the light-bulb example as an illustration, let us suppose that the 
average life of bulbs made under a standard manufacturing procedure is 1400 
hours. It is desired to test a new procedure for manufacturing the bulbs. The 
statistical model here is this: We are dealing with two populations of light bulbs: 
those made by the standard process and those made by the proposed process. 
We know (from numerous past investigations) that the mean of the first popula¬ 
tion is about 1400. The question is whether the mean of the second population 
is greater than or less than 1400. Traditionally, to answer this type of question, 
we set up the hypothesis that one mean is greater than the other mean. Then, 
on the basis of a sample from the population of the proposed process we shall 
either accept or reject the hypothesis. 

For our example, we formulate the hypothesis that the proposed process 
is no better than the standard process. Generally we hope that the hypothesis 
will be rejected. To test the hypothesis, a number of bulbs are made by the new 
process and their lives measured. Suppose that the mean of this sample of 
observations is 1550 hours. The indication is that the new process is better, but 
suppose that the estimate of the standard deviation of the mean dj^fn is 125 
(n being the sample size). Then a 95 percent confidence interval for the mean of 
the second population (assuming normality) is roughly 1300 to 1800 hours. The 
sample mean 1550 could very easily have come from a population with mean 1400. 
We have no strong grounds for rejecting the hypothesis. If, on the other hand, 
&lyfn were 25, then we could very confidently reject the hypothesis and pronounce 
the proposed manufacturing process to be superior. 

The testing of hypotheses is seen to be closely related to the problem of 
estimation. It will be instructive, however, to develop the theory of testing 
independently of the theory of estimation, at least in the beginning. 

In order to conveniently talk about testing of hypotheses, we need to intro¬ 
duce some language and notation and give some definitions. As was the case 
when we studied estimation, we will assume that we can obtain a random sample 
X u .. ■, X„ from some density/(•; &). A statistical hypothesis will be a hypoth¬ 
esis about the distribution of the population. 

Definition 1 Statistical hypothesis A statistical hypothesis is an asser¬ 
tion or conjecture about the distribution of one or more random variables. 
If the statistical hypothesis completely specifies the distribution, then it is 
called simple; otherwise, it is called composite. //// 

Notation To denote a statistical hypothesis, we will use a script capital 
XC followed by a colon that in turn is followed by the assertion that 
specifies the hypothesis. ! Ill 
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EXAMPLE 1 Let X x , ..., X n be a random sample from/(x; 0) = <j> 9i25 (x). 
The statistical hypothesis that the mean of this normal population is less 
than or equal to 17 is denoted as follows: ye : 0 < 17. Such a hypothesis 
is composite; it does not completely specify the distribution. On the 
other hand, the hypothesis : 0 = 17 is simple since it completely specifies 
the distribution. //// 

Definition 2 Test of a statistical hypothesis A test of a statistical hy¬ 
pothesis ye is a rule or procedure for deciding whether to reject ye. I HI 

Notation Let us use a capital upsilon T to denote a test. //// 


EXAMPLE 2 Let Aj, ..., Aj, be a random sample from/(x; 0) = 25 (x). 

Consider yf:Q<,\l. One possible test T is as follows: Reject ye if and 

only if X > 17 + 5/y/n. Ill I 

A test can be either randomized or nonrandomized. The test T given in 
Example 2 above is an example of a nonrandomized test. Another possible 
test, say Y', of ye in Example 2 is the following: Toss a coin, and reject ye if 
and only if a head appears. Such is an example of a randomized test. Although 
we will make little use of randomized tests in this book, we do include their 
definition. Definitions of both nonrandomized and randomized tests follow. 

Notation As in previous chapters, we let X denote the sample space of 
observations, or the potential data set; that is, X = {(xj,..., x„): (x l5 ..., x„) 
is a possible value of (Aj,..., Aj,)}. //// 

Definition 3 Nonrandomized test and critical region Let a test Y of a 
statistical hypothesis ye be defined as follows: Reject if and only if 
(x ls ■ ■ •, x„) e C T , where C r is a subset of 3E; then Y is called a nonrandom¬ 
ized test, and C T is called the critical region of the test Y. //// 


EXAMPLE 3 Let Aj, ..., Aj, be a random sample from /(x; 0) = (j) e , 2 s( x )- 
X is euclidean n space. Consider ye: 0 < 17 and the test Y: Reject 3te if 
and only if x > 17 + 5 j^/n- Then Y is nonrandomized, and C T = 
{(x l5 ...,x„):x> 17 + 5/V*}. 1111 
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Remark A nonrandomized test T of 2/f is a decomposition of X into C T 
and C T such that if (x t , ..., x„) e C r , Jtf is rejected. So a nonrandomized 
test is specified by its corresponding critical region. //// 

Definition 4 Randomized test A test T of a hypothesis is defined to 
be a randomized test if T is defined by the function ^ T (xj, ..., x„) = 
P{#P is rejected|(x l5 ..., x„) is observed]. The function ..., •) is 
called the critical function of the test Y. //// 

The actual performing of a nonrandomized test Y of ^ is straightforward; 
one observes a random sample, say x 1 ,..., x„, checks to see whether the observed 
sample falls in the critical region, and rejects when it does. On the other 
hand, to perform a randomized test Y of one first observes the random sample, 
say x ls . ■ ■, x„, then evaluates i/i r (x l5 ..., x„), and finally observes the result of 
some auxiliary Bernoulli trial that has ■ ■ ■, x„) as its probability of success, 

and if the Bernoulli trial results in a success, then is rejected. Since the 
performance of the auxiliary Bernoulli trial is extraneous to the actual testing 
problem, that is, it does not depend on the data x x ,..., x„ of the experiment, one 
might reasonably wonder why its result should be the deciding factor in accepting 
or rejecting the hypothesis. It is for this reason that jraadoniized tests are not 
often employed in practice; when they are, usuallyYhe sample space X is decom¬ 
posed into three sets, one where the hypothesis is accepted, another where the 
hypothesis is rejected, and the third where “ randomization ” takes place. This 
third region is often the boundary between the acceptance and rejection region 
and/or a region where it is not easy to decide whether to accept or reject. The 
following example may help in understanding randomized tests. 


EXAMPLE 4 Let X u ..., Y 10 be a random sample of size 10 from/(x; 9) = 
9 X (\ — 9f~ x for x = 0 or 1. Suppose we want to test the hypothesis 


10 


3?:9 <\. A possible test Y is the following: Reject if Y. x i >5, accept 


Jd if Yj X; < 5, and decide between rejecting and accepting by tossing a 


10 


fair coin if £ x ; = 5. (The tossing of the coin is the auxiliary Bernoulli 
i 

trial.) Such a test Y partitions X into three regions, say A, B, and C, 
where 
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and 


C= |(x l5 ..., x 10 ): X x i > 5 j- 
The critical function of test T is given by 

if (xj, ..., x 10 ) 6 C 


iAr( x i) ■■■> *io) — 


1 

1/2 if (x l9 ..., x 10 ) e B 
0 if (x l5 x 10 )eA. 


HU 


Remark We saw that a nonrandomized test was specified by its critical 
region. Likewise, a randomized test is specified by its critical function. 
In fact any function t j/(-, •) with domain X and counterdomain the 

interval [0, 1] is a possible critical function and defines a randomized 
test. //// 


The following remark shows that a nonrandomized test is a particular case 
of a randomized test. 


Remark If test T has a critical function defined by 

•M*.otherwise ’ 6 ^ r ) = Ic r ( x i .*„>, 

then T is a nonrandomized test with a critical region C T . 


//// 


As we mentioned earlier, we will not make extensive use of randomized 
tests. Theorem 1 below requires their use; other than that, their only use will 
be in obtaining tests of exact size (see Definition 7), and then only for sampling 
from discrete distributions. 

In many hypotheses-testing problems two hypotheses are discussed: The 
first, the hypothesis being tested, is called the null hypothesis, denoted by Jf 0 , 
and the second is called the alternative hypothesis, denoted by ,. The think¬ 
ing is that if the null hypothesis is false, then the alternative hypothesis is true, 
and vice versa. We often say that is tested against, or versus, If the 
null hypothesis 0 is not rejected, we say that 3^ 0 is accepted. With this kind 
of thinking, two types of errors can be made. 


Definition 5 Types of error and size of error Rejection of Jf 0 when it is 
true is called a Type I error, and acceptance of 0 when it is false is called 
a Type II error. The size of a Type I error is defined to be the probability 
that a Type I error is made, and similarly the size of a Type II error is the 
probability that a Type II error is made. //// 
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If the distribution from which the sample was obtained is parameterized by 
0 , where 0 6 0 , then associated with any test is a power function, defined as in 
Definition 6 . 

Definition 6 Power function Let T be a test of the null hypothesis 3tf 0 . 
The power function of the test T, denoted by n r (6), is defined to be the 
probability that Xd 0 is rejected when the distribution from which the 
sample was obtained was parameterized by 0 . //// 

The power function will play the same role in hypothesis testing that mean- 
squared error played in estimation. It will usually be our standard in assessing 
the goodness of a test or in comparing two competing tests. An ideal power 
function, of course, is a function that is 0 for those 0 corresponding to the null 
hypothesis and is unity for those 0 corresponding to the alternative hypothesis. 
The idea is that you do not want to reject 0 if is true and you do want to 
reject when is false. 

Remark n Y (0) = /yreject Jf 0 ], where 0 is the true value of the param¬ 
eter. If T is a nonrandomized test, then n r (6) = P 0 [(X x , X n ) e C r ], 
where C T is the critical region associated with test T. If T is a randomized 
test with critical function \j/ T then 

7 c r (0) = P s [reject^ 0 ] 

= /" ' J P[reject J^ 0 \x u ..., xj/*,. x fx u ...,x n ; O^dxt 

= JVr(xi, . Xn (xi,...,x n ;0)nrfx ; 

= S' g [\l/ r (X 1 ,..., X„)]. 

The argument is similar for discrete random variables. //// 


EXAMPLE 5 Let X u X n be a random sample from/(x; 0) = 4> 0 , 2 s(x). 
Consider Jf 0 : 0 < 17 and the test T: Reject if and only if X > 17 + 5/^/n. 


7t r (0) 


= Pe 


X > 17 + 


: 1 = I _J I7 + S 'f-A 

i J V 5 L n i 


5/Jn 


For n = 25, n r (6) is sketched in Fig. 1. 
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The power function is useful in telling how good a particular test is. 
In this example, if 9 is greater than about 20, the test T is almost certain 
to reject 34? 0 , as it should. And if 9 is less than about 16, the test T is 
almost certain not to reject 34? 0 , as it should. On the other hand, if 
17 < 9 < 18 (so 34? o is false), the test T has less than half a chance of 
rejecting 34 ? 0 . //// 

Definition 7 Size of test Let Y be a test of the hypothesis 34? 0 : 0 e S 0 , 
where S 0 C Si that is, S 0 is a subset of the parameter space 0. The size 
of the test Y of 34? 0 is defined to be sup [7t r (0)]. The size of the test for 

0 e0o 

a nonrandomized test is also referred to as the size of the critical region. 

//// 

Remark Many writers use the terms “ significance level ” and “ size of 
test” interchangeably. We, however, will avoid use of the term “signi¬ 
ficance level,” intending to reserve its use for tests of significance, a type of 
statistical inference that is closely related to hypothesis testing. Tests of 
significance will not be considered in this book; the interested reader is 
referred to Ref. [37]. //// 

EXAMPLE 6 Let X lt ..., X n be a random sample from f(x; 9) = <p 0t 25 (x). 
Consider the 34? 0 : 9 < 17 and the test Y: Reject 34? 0 if X > 17 + 5/yV 

0 O = {9- 9 < 17} and the size of the test Y is sup [7i T (0)] 

» 6§0 

1111 

In our study of point estimation, we found that for certain considerations 
we could restrict attention to estimators that were functions of sufficient statistics 
only. The same is true for testing hypotheses when the power function is used 
as a basis of comparing tests, as the following theorem shows. 
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Theorem 1 If X u ..., X n is a random sample from f(x; 6), where 0 e 0, 
and Si = <j 1 (A' 1 , ..., X n ), S r = o r (X t , X„) is a set of sufficient sta¬ 
tistics, then for any test Y with critical function \j/ r , there exists a test, say 
Y', and corresponding critical function, say ipr' > depending only on the set 
of sufficient statistics which satisfies n r (0) = n r ,(0) for all 0 e 0. 

proof Define tM^, ..., s r ) = ..., X n )\= s 1( ..., 

S, = j r ]; then i j/ r . is a critical function. Furthermore, n r .(0) = 

• • • > *^r)] = <^’e[<^[ | Ar('^i) ■ • • > -Y„)| S u ..., S r ]] = d>»[*Pr(Xi> • • • > 
X n )] = *r(0). IIH 

The theorem shows that given any test, another test which depends only on 
a set of sufficient statistics can be found, and this new test has a power function 
identical to the power function of the original test. So, in our search for good 
tests we need only look among tests that depend on sufficient statistics. 

We have introduced some of the language of testing in the above. The 
problem of testing is like estimation in the sense that it is twofold: First, a 
method of finding a test is needed, and, second, some criteria for comparing 
competing tests are desirable. Although we will be interested in both aspects 
of the problem, we will not discuss them in that order. First we will consider, 
in Sec. 2, the problem of testing a simple null hypothesis against a simple alter¬ 
native. Two approaches will be assumed. The first will use the power function 
as a basis for setting goodness criteria for tests, and the second will use a loss 
function. The Neyman-Pearson lemma is stated and proved. It will turn out 
that all those tests, which are best in some sense, will be of the form of a simple 
likelihood-ratio, which is defined. 

Tests of composite hypotheses will be discussed in Sec. 3. The section 
will commence, in Subsec. 3.1, with a discussion of the generalized likelihood- 
ratio principle and the generalized likelihood-ratio test. This principle plays a 
central role in testing, just as maximum likelihood played V central role in es¬ 
timation. It is a technique for arriving at a test that in general will be a good 
test, just as maximum likelihood led to an estimator that in general was quite a 
good estimator. For a book of the level of this book, it is probably the most 
important concept in testing. The notion of uniformly most powerful tests 
will be introduced in Subsec. 3.2, and several methods that are sometimes use¬ 
ful in finding such tests will be presented. Unbiasedness and invariance in es¬ 
timation are two methods of restricting the class of estimators with the hope 
of finding a best estimator within the restricted class. These two concepts play 
essentially the same role in testing; they are methods of restricting the totality of 
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possible tests with the hope of finding a best test within the restricted class. We 
will discuss only unbiasedness, and it only briefly in Subsec. 3.3. Subsection 
3.4 will summarize several methods of finding tests of composite hypotheses. 

Section 4 will be devoted to consideration of various hypotheses and 
tests that arise in sampling from a normal distribution. Section 5 will consider 
tests that fall within a category of tests generally labeled chi-square tests. 
Included will be the asymptotic distribution of the generalized likelihood-ratio, 
goodness-of-fit tests, tests of the equality of two or more distributions, and tests 
of independence in contingency tables. Section 6 will give the promised dis¬ 
cussion of the connection between tests of hypotheses and interval estimation. 
The chapter will end with an introduction to sequential tests of hypotheses in 
Sec. 7. 

The reader will note that our discussion of tests of hypotheses is not as 
thorough as that of estimation. Both testing and estimation will be used in 
later chapters, especially in Chap. X. Also, a number of the nonparametric 
techniques that will be presented in Chap. XI will be tests of hypotheses. 

We stated at the beginning of this section that testing of hypotheses is one 
major area of statistical inference. A type of statistical inference that is closely 
related (in fact so closely related that many writers do not make a distinction) to 
hypothesis testing is that of significance testing. The concept of significance 
testing has important use in applied problems; however, we will not consider it 
in this book. The interested reader is referred to Ref. [37]. 


2 SIMPLE HYPOTHESIS VERSUS 
SIMPLE ALTERNATIVE 

2.1 Introduction 

In this section we consider testing a simple null hypothesis against a simple 
alternative hypothesis. This case is actually not very useful in applied statistics, 
but it will serve the purpose of introducing us to the theory of testing hypotheses. 

We assume that we have a sample that came from one of two completely 
specified distributions. Our object is to determine which one. More precisely, 
assume that a random sample X i ,..., X n came from the density / 0 (x) or/i(x) 
and we want to test XC 0 : X t distributed as f 0 (■), abbreviated X t ~/ 0 ( •), versus 
%i ~/i( - )- If we had only one observation x 1 and / 0 (-) and fi(;) were 
as in Fig. 2, one might quite rationally decide that the observation came from 
/o(0 if/oOq) >/i(Xj) and, conversely, decide that the observation came from 
fi(') if/i(*i) >/o(x 1 ). This simple intuitive method of obtaining a test can be 
expanded into a family of tests that, as we shall see, will contain some good tests. 
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Definition 8 Simple likelihood-ratio test Let X x , ..., X„ be a random 
sample from either / 0 (-) or /,(•)• A test T of Xf 0 \ X ; ~ / 0 ( •) versus 
: X t ~ /,(•) is defined to be a simple likelihood-ratio test if T is defined 
by 

Reject Jfo if I < k, 

Accept Jtjf o if X > k, 

Either accept , reject XC 0 , or randomize if ), = k, (1) 

where 


l = A(x lv ..,x„) 


n /o(X;) L 0 (x 1 ,...,x n ) _ L 0 

ft/iC*,) Ll(Xl ’- >X '’ ) L > 

i = 1 


and is a nonnegative constant. [L ; =L i (x l ,..., x„) is the likelihood 
function for sampling from the density /}(• )•] llll 

For each different k we have a different test. For a fixed A the test says to 
reject XC 0 if the ratio of likelihoods is small; that is, reject ,?f 0 if it is more likely 
(L x is large compared to L 0 ) that the sample came from/j(-) than from f 0 {-). 
Such a test certainly has intuitive appeal. In fact, one might suspect that an 
optimum test will have to be the form of a simple likelihood-ratio test. 

Optimality of a test of a simple hypothesis versus a simple alternative can 
be approached in two ways. One way, using the power of the test to set goodness 
criteria, is discussed in Subsec. 2.2, and the other way, using a loss function and a 
decision-theoretical approach, is considered in Subsec. 2.3. 


2.2 Most Powerful Test 

Let X r , .. ., X n be a random sample from the density/ 0 (-) or the density f x (-). 
Let us write / 0 (x) = f(x; 0 o ) and f x (x) = /(x; 0 X ); then X lt ..., X n is a random 
sample from one or the other member of the parametric family {f(x; 9): 6 = 8 0 
or 9 = Oj}. S = {0 o , Oj} is a parameter space with only two points in it. 0 o and 
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Q { are known. We want to test 34? 0 : 6 = 0 o versus 34?!: 0 = Oj. Corresponding 
to any test Y of 34? 0 versus 34? i is its power function n r (0). A good test is a test 
for which n r (9 0 ) = P[reject 34? 0 \ 34? 0 is true] is small (ideally 0) and n r (0 t ) = 
/’[reject X ? 0 1 is false] is large (ideally unity). One might reasonably use the 
two values n r (6 0 ) and n^) to set up criteria for defining a best test. n r (0 o ) = 
size of Type I error, and 1 - it-r^) = /’[accept 34? 0 1 34? 0 is false] = size of Type II 
error; so our goodness criterion might concern making the two error sizes small. 
For example, one might define as best that test which has the smallest sum of the 
error sizes. Another method of defining a best test, made precise in the following 
definition, is to fix the size of the Type I error and to minimize the size of the 
Type II error. 

Definition 9 Most powerful test A test Y* of 34? 0 : 6 = 6 0 versus 
34? i: 0 = 0 t is defined to be a most powerful test of size a (0 < a < 1) if and 
only if: 

(i) n r ,(9 0 ) = a. (2) 

(ii) 7r T ,(0!) > 7r Y (fl|) for any other test Y for which 7t T (0 o ) < a. (3) 

//// 


A test Y* is most powerful of size a if it has size a and if among all other 
tests of size a or less it has the largest power. Or a test Y* is most powerful of 
size a if it has the size of its Type I error equal to a and has smallest size Type II 
error among all other tests with size of Type I error a or less. 

The justification for fixing the size of the Type I error to be a (usually 
small and often taken as .05 or .01) seems to arise from those testing situations 
where the two hypotheses are formulated in such a way that one type of error is 
more serious than the other. The hypotheses are stated so that the Type I error 
is the more serious, and hence one wants to be certain that it is small. 

The following theorem is useful in finding a most powerful test of size a. 
The statement of the theorem as given here, as well as the proof, considers only 
nonrandomized tests. We might note that the statement and proof of the 
theorem can be altered to include all randomized tests. 

Theorem 2 Neyman-Pearson lemma Let Aj, ..., X n be a random 
sample from f(x; 9), where 9 is one of the two known values 9 0 or 0 lt 
and let 0 < a < 1 be fixed. 

Let k* be a positive constant and C* be a subset of Y which satisfy: 

(0 p e 0 [(Xu X n )eC*] = a. (4) 
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L(9 0 ; Xl , ...,x n ) Lo< k * 
L(0,; x,, x„) L 


if (x 1; ..., x„) e C* (5) 


and X > k* if (x l5 ..., x„) e C*. 


Then, the test Y* corresponding to the critical region C* is a most powerful 
test of size a of 0 •' 0 = 0 o versus : 0 = 0^ [Recall that Lj = 

ft 

L(0j\ Xj,..., x„) = /(x £ ; Oj) for j = 0 or 1 and C* is the complement of 

C*; that is, C* = X - C*.] 

proof Suppose that k* and C* satisfying conditions (i) and (ii) 
exist. If there is no other test of size a or less, then Y* is automatically 
most powerful. Let Y be another test of size a or less, and let C be its 
corresponding critical region. We have P e<> [(X l , ..., X n ) e C] < a. We 
must show that n r .(0,) > zr-rf^i) to complete the proof. {For any subset 

R of X, let us abbreviate J • • • J \\f{xt',0 } ) dx t as J K Lj for j = 0, 1. 

R U=1 J 

Our notation indicates that / 0 (-) and/[(•) are probability density func¬ 
tions. The same proof holds for discrete density functions.} Showing 
that n r .(0j) > tt t ( 0 1 ) is equivalent to showing that J c . L x > J c L 1 . See 
Fig. 3. 

Now J c . Lj — J qLi = J c *c F/i ~ Jcc* — (1 / k*) Jc»c L 0 — (1 / k*) j"cC'F 0 
since Lj > L 0 jk* on C* (hence also on C*C) and L i < L 0 jk*, or — > 

-L 0 /k*, on C* (hence also on CC*). But (1/A:*) (J c , c L 0 -f cc , L 0 ) 
= (l/k*)(j c , c L 0 + Jc *c k 0 — Jc *c Y 0 — Jcc* ^-o) = Olk*)(j c .L 0 — J c L 0 ) = 
(l/k*)(a — size of test Y) > 0; so J c * A — J c L l > 0, as was to be shown. 

//// 

We comment that k* and C* satisfying conditions (i) and (ii) do not always 
exist and then the theorem, as stated, would not give a most powerful size-a 
test. However, whenever/ 0 (-) and/}(•) are probability density functions, a k* 
and C* will exist. Although the theorem does not explicitly say how to find k* 
and C*, implicitly it does since the form of the test, that is, the critical region, is 
given by Eq. (5). In practice, even though k* and C* do exist, often it is not 
necessary to find them. Instead the inequality X < k* for (x 1; ..., x„) 6 C* is 
manipulated into an equivalent inequality that is easier to work with, and the 
actual test is then expressed in terms of the new inequality. The following 
examples should help clarify the above. 
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EXAMPLE 7 Let X t ,be a random sample from/(x; 0) = Oe ej 7 (0 , co) (x), 
where 0 = 0 O or 0 = 0 t . 0 O and 0 t are known fixed numbers, and for 
concreteness we assume that 0 t > 0 o . We want to test 34? 0 : 9 = 0 O versus 
34?^. 9 = 0 t . Now L 0 = 9q exp (-0 O E *.)> L i = ex P ( _ ^i E *«)» an d 
according to the Neyman-Pearson lemma the most powerful test will have 
the form: Reject 34? 0 if ). < k* or if (9 0 /9 1 ) n exp [— (9 0 — 0,) £ xj < k*, 
which is equivalent to 

^ ^'^dA 108 -[©"*']' t ' (5ay) ' 

where k' is just a constant. The inequality l < k* has been simplified and 
expressed as the equivalent inequality £ x ; < k'. Condition (i) is a = 
P„ 0 [reject 34? 0 ] = P 9o [ £ X t < A:']. We know that £ X t has a gamma 
distribution with parameters n and 9 ; hence 

P» 0 [ I X t < k'] = f ~ 9" 0 xr- l e-*"> dx = a, 

J o > \ n ) 

an equation in k', from which k' can be determined; and the most powerful 
test of size a of ^f 0 : 9 = 9 0 versus XC x \ 9 = 9^,9 ^ > 0 O is this: Reject 34? 0 
if £ x t < k’, where k' is the ath quantile point of the gamma distribution 
with parameters n and 9 0 . jjjf 

EXAMPLE 8 Let X u X n be a random sample from f(x; 9) = 
9 x (l - 9) l ~ x I { 0> 1} (x), where 9 = 9 0 or 9 = 9^. We want to test 34? 0 :9 = 9 0 
versus 6 = 9 U where, say, 9 0 < 9 t . L 0 = 9l x ‘( 1 - 9 0 ) n ~ Zx ‘, L x = 
0i x '(l — 0 l )"~ Zxi , and so X < k* if and only if 

0o Xi (l - 0or Exi /0p'(l - 9 i f~ Zx ‘ < k*, { 



if and only if 
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or if and only if X x : — k\ where k' is a constant. {Note that 
log e [0 O (1 — 9 l )/( 1 — OqIOj] < 0.} So a most powerful testwouldbe of the 
form: Reject ^f 0 if X x i i s large. For definiteness, let us take 0 o = -}, 
9i = and n = 10. We must find k' so that 

a = P flo =i/4[reject^’ 0 ] = ^e 0 =i/4 [X X i >■ kf\ = l ( I ; °jQ 

If a = .0197, then k' = 6, and if a = .0781, then k' = 5. For a = .05, there 
is no critical region C* and constant k* of the form given in the Neyman- 
Pearson lemma. In this example our random variables are discrete, and 
for discrete random variables it is not possible to find a k* and C* satisfying 
conditions (i) and (ii) for an arbitrary fixed 0 < a < 1. In practice, one is 
usually content to change the size of the test to some a for which a test 
given by the Neyman-Pearson lemma can be found. We might note, 
however, that a most powerful test of size a does exist. The test would be 
a randomized test. For the example at hand, if we take a = .05, the 
randomized test with critical function 



T 

if X x i ^ 6 

«K*i, •••, *io) = 

.05 - .0197 

.0584 

M 

x 

II 

Ui 


o 

if X x i — 4 

is the most powerful test of size a 

= .05. 

llll 

In closing this subsection, we note that a most powerful test of size a, 
given by the Neyman-Pearson lemma, is necessarily a simple likelihood-ratio 


test. 


2.3 Loss Function 

As in the last subsection, we assume that we have a random sample X u ..., X„ 
from one or the other of the two completely known densities/ 0 ( •)=/('; 9 0 ) and 
/j(-) = /(•; Oj). On the basis of the observed sample we have to decide from 
which of the two densities the sample came; that is, we test 0 : 9 — 9 0 versus 
,?f j: 0 = 0i- We can make one of two decisions, say d 0 or d u where dj is the 
decision that /}(•) is the density from which the sample came, j = 0, 1. We 
assume that a loss function is available. 

Definition 10 Loss function In testing ^f 0 : 9 = 9 0 versus : 9 = 9 l , 

define i(d { \ 9j) = loss incurred when decision d ; is made and 9j is the true 
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parameter value for i = 0,1 and j = 0,1, where d t is the decision of deciding 
that hypothesis ; is correct. We will adopt the convention that 
e(dp, 0 ; ) = o for I = 0, 1 and t(dp, Oj) > 0 for i //// 

The values of the loss function are the amounts that are lost if decision d t 
is made when 9j was correct. With our convention, nothing is lost if the right 
decision is made, and a positive amount is lost if a wrong decision is made. If 
we think in terms of a (nonrandomized) test T having a critical region C r , then 
decision d l is made if the observed sample (x t , ..., x n ) belongs to C T , and 
decision d 0 is made if (*j,..., x n ) belongs to C T . A test can be thought of as a 
decision function since for a given observed sample the test tells us which decision 
to make. We do not consider the problem of selecting an appropriate loss 
function, and hence we will always assume that an appropriate loss function has 
been prescribed. Note that if decision d 1 is made when 0 o is correct, then a 
Type I error is made, and if decision d 0 is made when 0, is correct, then a Type II 
error is made. 

In comparing two tests we naturally prefer that test which has smaller loss, 
and among all tests we would prefer that test which has smallest loss. However, 
seldom will there exist one test that has smallest loss for both possible decisions 
and for both 0 o and 0 t . This motivates the defining of average loss, and by 
continuing to borrow language from decision theory we define the risk function 
of a test. 


Definition 11 Risk function For a random sample X lt ..., X n from 
/(•; 9 0 ) orf( -; 9 t ), let T be a test of XC 0 \ 9 = 9 0 versus : 0 = 0 X having 
a critical region C T . For a given loss function f {-; •). the risk function 
of T, denoted by ^ T (0), is defined to be the expected loss; that is 


-S' c ‘J^di;9) 


f[f(Xi ; 0) dx; +[••• jV(d 0 ; 9) Ylf(x i ;9)dx i . 
i=i J J c r J L;=i J 


llll 


Remark 

®r(0) = t(d ,; 9)P e [(X u ...,X n )eC r ]+ t(d Q ; 9)P e [(X l , ...,X„)e C r \ 

= t(d ,; 9)n r (9) + £{d 0 ; 0)[1 - n r (9 )]; (6) 

that is, the risk function is a linear function of the power function; the 
coefficients in the linear function are determined by the values of the loss 
function. Since 9 assumes only two values, ^ T (0) can take on only two 
values, which are 

^r(0 o ) = t(d x ;9 0 )n r (9 0 ) and ^ T (0 t ) = t(d 0 ; ^>[1 - (7) 

llll 
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Our object is to select that test which has smallest risk, but, unfortunately, 
such a test will seldom exist. The difficulty is that the risk takes on two values, 
and a test that minimizes both of these values simultaneously over all possible 
tests does not exist except in rare situations. (See Prob. 3.) Not being able 
to find a test with smallest risk, we resort to another less desirable criterion, that 
of minimizing the largest value of the risk function. 

Definition 12 Minimax test A test Y m of : 0 = 0 O versus ,?f j : 0 = 0, 
is defined to be minimax if and only if 

max [M r J0 o ), ^r m (0i)] < max [M r (0 o ), 
for any other test Y. //// 

The following theorem is sometimes useful in finding a minimax test. (As 
with Theorem 2, we state the theorem and proof in terms of nonrandomized 
tests.) 


Theorem 3 For a random sample Aj, ..., X n from /(•; 0 O ) or /(' ; 0i), 
consider testing W 0 : 0 = 0 O versus : 0 = 9 l . If a test Y m has a critical 
region given by C m = {(x l5 ..., x„): X <, k m }, where k m is a positive constant 
such that ffi'cjfio) = ^r m (^i)> then Y m is minimax. Recall that 

2 = L 0 m = [ n /(*,; o 0 )]/[ ri f(xr, 0,)]. 

t=l i= 1 

proof We will assume that /(•; 0 O ) and /(•; 0 t ) are probability 
density functions. The proof for discrete density functions is similar. 

Let Y be any other test with a critical region C r which satisfies 
M { (0 q ) < M Xm (0o). [Note that if ^ r (0 o ) > : %rJ0 o ), then Y would not 
even be a candidate for minimax.] We have ^ r (0 o ) ^ ^r m (9 0 ), 
t(di \ 0 o )n r (0 o ) < £{d x \ 0 o )n r J0 o ), or n r (0 o ) < n T J0 o ); that is, Y has size 
less than or equal to that of Y m . But, by the Neyman-Pearson lemma, 
Y m is the most powerful test of size 7t Tm (0 o ); hence 7i T (0i) < 7t Tm (0 1 ), 1 — 

^rC^i) — 1 — ^r m (0i), ^i)[l — 7t r(^i)] ^ ^(^o> $i)D — or 

^ so, we have max lM r J0 o ), :% r J0i)] = ^r m (0i) < 

& r (0i) < max [M r (0 o ), M r (0 t )}; that is, Y m is minimax. I HI 

EXAMPLE 9 Let X u ..., X„ be a random sample from/(x; 0) = Oe~ ex I (0 rjo) (x). 
For 0! > 0 O , test 0 : 0 = 0 O versus : 0 = 0 t . In Example 7, we found 
the most powerful size-a test. We seek now to find the minimax test for 
a loss function given by t(d y \ 0 O ) = a and t(d 0 ; 0J = b. According to 
Theorem 3, the minimax test Y m is given by C m = {(x 1; ..x„): /. < k m } 
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where k m is such that J? Tm (0 o ) = {A < k m } can be rewritten as 

{£ x ; < A:} for some constant k, and ^ Tm (0 o ) = 3#rJ0i) if and only if 
an r J.0o) = *[1 “ so we seek k such that aP do [£ X t < k\ = 

bPeXI X ; > 4 k is given as the solution to 

a = b C-^-O n l x n - i e~ 9 ' x dx. //// 

J 0 r(«) 4 r(«) 

Before leaving minimax tests we make two comments: First, if /of) and 
f t (-) are discrete density functions, then there may not exist a k m such that 
je 0 ) = @ r J0i) unless randomized tests are allowed; and, second, a minimax 
test as given in Theorem 3 was a simple likelihood-ratio test. 

In the above we assumed that Y,, ..., X n was a random sample from 
/(•; 0), where 6 = 6 0 or 0 U and for each 0, /(•; (?) is completely known. We 
also assumed that we had an appropriate loss function. Now, we further 
assume that 9 0 and 0, are the possible values of a random variable 0 and that 
we know the distribution of 0, which is called the prior distribution, just as in 
our considerations of Bayes estimation. 0 is discrete, taking on only two 
values 0 O and 0,; so the prior distribution of 0 is completely given by, say, g, 
where g = P[0 = 0J = 1 - P[Q = 0 O ], We mentioned above that, in general, 
a test with smallest risk function for both arguments does not exist. Now that 
we have a prior distribution for the two arguments of the risk function, we can 
define an average risk and seek that test with smallest average risk. 


Definition 13 Bayes test A test Y ff of Jf 0 ; 0 = 0 o versus : 0 = 0j is 
defined to be a Bayes test with respect to the prior distribution given by 
g = Ft© = 0J if and only if 

(1 - ffWr g (0 o ) + g^r/O,) < (1 - g).M r (0 o ) + gM^O,) ( 8 ) 

for any other test Y. //// 

To find a Bayes test we seek a critical region C g that minimizes 
(1 -g)£ r (0o) +g®r(.O l ) = (1 -gy(dy -,0 o )n r (t0 o )+gt(d o -, 0 t )[l - jt^)] 
= (1 — !; 0 O ) Jc L 0 + gt{d 0 ; 0i) Jc 

as a function of the region C. Now 
(1 — 9)^ 04; 0 O ) \cL o + gt{d 0 \ 0j) J c L 1 

= 9t(d 0 \ 4) + J c [(l - g)t(d, ; 0 O )L O - gf(d 0 ; 0JYJ, 


( 9 ) 
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which is minimized if C is defined to be all (x u x„) for which the last inte¬ 
grand of Eq. (9) is negative; that is 

C g = {(*!, • • •, x„): (1 - g)£{d x ; 0 o )L o - gt(d 0 ; < 0}. (10) 

We have proved the following theorem. 


Theorem 4 The Bayes test Y g of Jf 0 : 0 = 0 o versus : 0 = 0! with 
respect to a prior distribution given by g = P[Q = 0J has a critical region 
defined by 


C,= 



■ x„): A < 


gt(d f> ; 0 )) | 

(1 -gy(d i; e 0 )j 


01 ) 

IIII 


We note that once again a good test, in this case a Bayes test, turns out to 
be a simple likelihood-ratio test. The exact form of the Bayes test is given by 
Eq- (11). 


EXAMPLE 10 Let X u ..., X n be a random sample from f(x\ 0) = 
0e~ 8j 7 (Oi oo)(*). Test Jf 0 : 0 = 0 O versus . 9 = 9 1 . The critical region 
of a Bayes test is given by 


C. = 



■ °° ex P (~°o I x il 9t(.d 0 \ 0,) 1 

9" exp( -9 l £x i ) (1 - g)t(d x ; 0 O )| 





1 


0i — 9 0 


log. 


g9\f(d 0 \ 0i) | 

(1 — g)9p /f(d 1 ; 0 o )j 


for 0! > 0 O ■ IllI 


3 COMPOSITE HYPOTHESES 

In Sec. 2 above we considered testing a simple hypothesis against a simple 
alternative. We return now to the more general hypotheses-testing problem, 
that of testing composite hypotheses. We assume that we have a random sample 
from f(x\ 0), 0 6©, and we want to test : 0s0 o versus : 0 e ©j, where 
§o c c 0, and S 0 and Sj are disjoint. Usually ©i = 0 — 0 O • We 

begin by discussing a general method of constructing a test. 
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3.1 Generalized Likelihood-ratio Test 

For a random sample Aj, ..., Aj, from a density f(x;9),9e 0, we seek a test of 
Xe 0 \ 0 e So versus Xd v : 0 e 0j = 0 - 0 O . 


Definition 14 Generalized likelihood-ratio Let L(9; xj, x„) 

be the likelihood function for a sample Aj, ..., X„ having joint density 
f x (x lf ■ ■ ■, x n ; 0) where 9 e 0. The generalized likelihood-ratio, 

denoted by X or X„, is defined to be 


X = 


X n — X(xi, . •., x n ) 


sup UP;x u ...,x n ) 

a e So_ 


sup L(9;x u ...,x n ) 

«eS 


( 12 ) 
IIII 


Note that X is a function of x t , ..., x n , namely X(x I , ..., x n ). When the 
observations are replaced by their corresponding random variables Aj, ..., X n , 
then we write A for X; that is, A = X(X t ,..., X n ). A is a function of the random 
variables Aj, ..., Aj, and is itself a random variable. In fact, A is a statistic 
since it does not depend on unknown parameters. 

Several further notes follow: (i) Although we used the same symbol X to 
denote the simple likelihood-ratio, the generalized likelihood-ratio does not 
reduce to the simple likelihood-ratio for 0 = {0 o , 0J. (ii) X given by Eq. (12) 
necessarily satisfies 0 < X < 1; X > 0 since we have a ratio of nonnegative 
quantities, and X < 1 since the supremum taken in the denominator is over a 
larger set of parameter values than that in the numerator; hence the denominator 
cannot be smaller than the numerator, (iii) The parameter 9 can be vector¬ 
valued. (iv) The denominator of A is the likelihood function evaluated at the 
maximum-likelihood estimator, (v) In our considerations of the generalized 
likelihood-ratio, often the sample Aj,..., Aj, will be a random sample from a 
density f(x; 0) where 0 6 0. 

The values X of the statistic A are used to formulate a test of : 0 6 0 O 
versus X(\ : 0 e 0 - 0 O by employing the generalized likelihood-ratio test prin¬ 
ciple, which states that is to be rejected if and only if X < X 0 , where X 0 is 
some fixed constant satisfying 0 < X 0 < 1. (The constant X 0 is often specified 
by fixing the size of the test.) A is the test statistic. The generalized likelihood- 
ratio test makes good intuitive sense since X will tend to be small when Xd 0 is 
not true, since then the denominator of X tends to be larger than the numerator. 
In general, a generalized likelihood-ratio test will be a good test; although there 
are examples where the generalized likelihood-ratio test makes a poor showing 
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compared to other tests. One possible drawback of the test is that it is some¬ 
times difficult to find sup L(9; x u ..., jc„); another is that it can be difficult to 
find the distribution of A which is required to evaluate the power of the test. 


EXAMPLE 11 Let X u ..., X n be a random sample from f(x; 6) = 
9e~ 9x I (0> m) (x), where S = {9; 9 > 0}. Test XC 0 :9 <9 0 versus jf 1 :9>9 0 . 

supL[(0;^ 1 ,...,^„)]=sup[0 n exp(-0^x i )]= 
flee e>o \ 2 _,xJ 

and 

sup [L(9; x lt ..., x„)] = sup [9 n exp (- 9 £ x f )] 

fleQo 0<flsflo 


Hence 



exp ( - 0 O E *«) if ^ > 9 0 . 


9 n 0 exp ( —0 O E *;) . f _n_ 
(n/'L x i) n e~ n * E*i 


(13) 


If 0 < 2 0 < 1) then a generalized likelihood-ratio testis given by the follow¬ 
ing: Reject XC 0 if k < k 0 , or 


n 19 V x \ n ... 

reject XT 0 if > 9 0 and ^ - J exp (- 9 0 Ex, + n) < X 0 , 

(14) 

or reject X? 0 if 9 0 x < 1 and (0 o x) n < 2, 0 . Write y = 9 0 x, and 

note that y n e ~ n< - y ~ 1) has a maximum for y — \. Hence y < 1, and 
y e -n(y-i) < jf an( j on jy if y < lc, where A’ is a constant satisfying 
0 < k < 1. See Fig. 4. 

We see that a generalized likelihood-ratio test reduces to the follow¬ 
ing: 


Reject X ? 0 if and only if 9 0 x < k, where 0 < k < 1; 


(15) 
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that is, reject ^f 0 if x is less than some fraction of 1/0 O . If that gen¬ 
eralized likelihood-ratio test having size a is desired, k is obtained as the 
solution to the equation 

r nk 1 

« = Pe 0 [e o X <k] = P eo [6 0 Y J X i <nk] = — u "- 1 e “ du. 

(Note that P g [9 0 X <k]< P eo [9 0 X < k] for 9 < 9 0 .) //// 

We note that in the above example the first form of the test, as given in 
Eq. (14), is rather messy, yet after some manipulation the test reduces to a very 
simple form as given in Eq. (15). Such a pattern often appears in dealing with 
generalized likelihood-ratio tests—their first form is often foreboding, yet the 
tests often simplify into some nice form. We will observe this again in Sec. 4 
below when we consider tests concerning sampling from the normal distribution. 

We might note, by considering the factorization criterion, that a generalized 
likelihood-ratio test must necessarily depend only on minimal sufficient statistics. 

In Sec. 5 below, a large-sample distribution of the generalized likelihood- 
ratio is given. This will provide us with a method of obtaining tests with 
approximate size a. 

3.2 Uniformly Most Powerful Tests 

In Subsec. 3.1 above we exhibited a method of obtaining a family of tests of 
0 o versus t : 9 e 0 — S 0 . We now define one optimum property 
that such a test may possess. It is defined in terms of the power function n r (9) 
and the size of the test. 

Definition 15 Uniformly most powerful test A test T* of Jf 0 : 9 e 0 O 
versus XP9 e 0 — S 0 is defined to be a uniformly most powerful size-a 
test if and only if: 

(i) sup i r ,(0) = a. 

0 6 6n 

(ii) 7t r ,(9) > n r (9) for all 9 e S — S 0 and for any test T with size 

less than or equal to a. //// 
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A test Y* is uniformly most powerful of size a if it has size a and if among 
all tests of size less than or equal to a it has the largest power for all alternative 
values of 0. The adverb “uniformly” refers to “all” alternative 9 values. A 
uniformly most powerful test does not exist for all testing problems, but when 
one does exist, we can see that it is quite a nice test since among all tests of size a 
or less it has the greatest chance of rejecting W 0 whenever it should. 


EXAMPLE 12 Let Aj, ..., X n be a random sample from f(x; 0) — 
9 e ~ ex I( 0 , oo)( x )> where 0 = {9: 9 > 0 o }. Find a uniformly most powerful 
test of Xf o: 9 = 9 0 versus :9 >9 0 . For fixed 9 t > 9 0 , we determined 
in Example 7 that the most powerful test of J^ 0 : 9 = 9 0 versus if x : 9 = 9 t 
was given by the following: Reject 0 if X x i — k', where k' was given as 
a solution to the equation 

i 

a = I —— 07, x" 1 e e ° x dx. 

J o r(«) 

Such a test was given by the Neyman-Pearson lemma. Note that the test 
in no way depends on 0 t except that 9, > 0 o ; hence, we would get the same 
most powerful test for any > 0 o , and thus the test is actually uniformly 
most powerful! //// 


The above example provides us with an example of a situation where a 
uniformly most powerful test can be obtained using the Neyman-Pearson lemma. 
That same technique can be used to find uniformly most powerful tests in more 
general situations, such as those given in Theorems 5 and 6 below, which are 
given without proof. 

Theorem 5 Let X u ..., X n be a random sample from the density f(x; 9), 
9 e S, where 0 is some interval. Assume that f(x; 9) = 

n 

a(0)b(x) exp [c(Q)d(x)}, and set ..., x„) = £ d(x ; ). 

1 

(i) If c(9) is a monotone, increasing function in 9 and if there exists 

k* such that P„ 0 [/(Aj, ..., X n ) > k*] = a, then the test Y* with a critical 
region C* = {(jtj, ...,x n ): /(x u ...,x n )> k*} is a uniformly most power¬ 
ful size-a test of ^f 0 : 9 < 9 0 versus 9 > 9 0 or of J^ 0 : 0 = 6 0 

versus : 9 > 9 0 . 

(ii) If c(9) is a monotone, decreasing function in 9 and if there 
exists k* such that P flo [/(Aj, ..., X n ) < k*] = a, then the test Y* with a 
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critical region C* = {(* l5 .x n ): t(x u x„) < k*} is a uniformly most 
powerful size-a test of 3f 0 : 0 <0 o versus : 0 > 0 O or of ,?f 0 : 0 = 0 O 
versus 0 > 0 O ■ till 

EXAMPLE 13 Let X u X n be a random sample from f(x; 0) = 
Oe~ 6x f 0> ^(x), where © = {9: 0 > 0}. Test 0 :9 < 0 o versus :0>0 o . 
fix', 9) = 0/ ( o,«,>(*) exp (-Ox) = a(9)b(x) exp [c(0)d(x )]; so /(x u ...,x n ) 

ft 

= V xi, and c(0) = -0. c(0) is a monotone, decreasing function; so 
1 

by (ii) of Theorem 5 a uniformly most powerful test is given by the follow¬ 
ing: Reject M’o if and only if X *; < k*, where k* is given as a solution to 

« = MX Xi < **] = f ~ 9 n o W-'e-o*" du. //// 

Definition 16 Monotone likelihood-ratio A family of densities { f(x ; 0): 
0 e 0, 0 an interval} is said to have a monotone likelihood-ratio 
if there exists a statistic, say T = /(X lt ..X„), such that the ratio 
L(0'; x lt x n )/L(0"; x lt x n ) is either a nonincreasing function of 
J(x t , ...,x n ) for every O' < 0" or a nondecreasing function of J(x lt ..., x„) 
for every O' < 0". //// 

Note that in the term “ monotone likelihood-ratio ” the likelihood-ratio is not 
a generalized likelihood-ratio; it is a ratio of two likelihood functions. 


EXAMPLE 14 If {f(x; 0): 060} = {0e~ 9x / (o ,«,)(*) -0>0}, then 


L(0' ;x 1 ,...,x n )_ (O'f exp (-O'X x t ) 
L(0";x u ...,x n ) (0") n exp (-0" X x i) 


exp [—(0' — 0") X x i\ 


which is a monotone, increasing function in . 


an 


EXAMPLE 15 If {f(x ; 0): 0 6 0} = {(1 /0)/ (o _ e) (*): 0 > 0}, then 


L(0 , Xi ,.. ■, x n ) 
L(0"; x u x„) 


(wnw*o 


(iwn^ow 

i= 1 


(l/0")' , /(o j9 -. ) (y n ) 



for 0 < y n < O' 
for O' < y n < 0", 
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which is a monotone, nonincreasing function in y„ = max [x lf ..., x„\. 
[Note that y n cannot fall outside of the interval (0, 9") when 9 is either 9' 
or 9\] mi 

Theorem 6 Let X u ..., X„ be a random sample from f(x; 9), where 0 is 
some interval. Assume that the family of densities {f(x; 9): 0 e 0) has a 
monotone likelihood-ratio in the statistic f(X lt X n ): 

(i) If the monotone likelihood-ratio is nondecreasing in /(x lf 
x „) and if k* is such that P 6o [/(X 1 , ..., X„)<k*] = a, then 

the test corresponding to the critical region C* = {(Xj, ..., x n ): 
/(x lf x„) < k*} is a uniformly most powerful test of size a of 
XC o: 9 <6 0 versus : 6 > 9 0 . 

(ii) If the monotone likelihood-ratio is nonincreasing in /(x x , 
x n ) and if k* is such that P eo y(X u •••, X n ) > k*] = a, then the 

test corresponding to the critical region C*={(x 1( x„): /(x u 

x„) > k*} is a uniformly most powerful test of size a of Jf 0 :9 <9 0 
versus 0 >0 o . HU 


EXAMPLE 16 Let X lt X n be a random sample from f(x; 9) = 
(l/0)/ (0 , 9 )W, where 0 >0. Test XC 0 \ 9 <9 0 versus ,: 9 >9 0 . We 
saw in Example 15 that the family of densities has a monotone, nonincreas¬ 
ing likelihood-ratio in J(x u ..., x„) =y„ = max [x l , .x n \- According 
to (ii) of Theorem 6, a uniformly most powerful size-a test is given by the 
following: Reject ..3f 0 if y„ > k*, where k* is given as the solution to 


« = PtJLYn > k*] 




which implies that k* = 9 0 Vl — a. 


//// 


Several comments are in order. First, the null hypothesis was stated as 
9 < 9 0 in both Theorems 5 and 6; if it had been stated as 9 > 9 0 , th£ two theorems 
would remain valid provided the inequalities that define the critical regions were 
reversed. Second, Theorem 5 is a consequence of Theorem 6. Third, the 
theorems consider only one-sided hypotheses. 

This completes our brief study of uniformly most powerful tests. We have 
seen that a uniformly most powerful test exists for one-sided hypotheses if the 
density sampled from has a monotone likelihood-ratio in some statistic. There 
are many hypothesis-testing problems for which no uniformly most powerful 
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test exists. One method of restricting the class of tests, with the hope and 
intention of finding an optimum test within the restricted class, is to consider 
unbiasedness of tests, to be defined in the next subsection. 

3.3 Unbiased Tests 

There are many hypotheses-testing problems for which a uniformly most powerful 
test does not exist. In these cases it may be possible to restrict the class of tests 
and find a uniformly most powerful test in the restricted class. One such class 
that has some merit is the class of unbiased tests. 

Definition 17 Unbiased tests A test T of the null hypothesis ^f 0 : 0 e S 0 
against the alternative hypothesis '■ 0 eQ, is an unbiased test if and only 
if 

sup n r (0) < inf n r (0). //// 

0 e 0 O fle0i 

Consequently in an unbiased test the probability of rejecting 3f 0 when it is 
false is at least as large as the probability of rejecting 3tf 0 when it is true. In 
many respects this seems to be a reasonable restriction to place on a test. If 
within this restricted class a test exists that is uniformly most powerful, then we 
have a uniformly most powerful unbiased test. An elaborate theory has been 
developed for finding uniformly most powerful unbiased tests, but we will not 
study it. See [16]. 

3.4 Methods of Finding Tests 

We presented in Subsec. 3.1 the generalized likelihood-ratio principle; it provides 
us with one method of obtaining tests of hypotheses. In Subsec. 3.2 we gave a 
method of finding a uniformly most powerful test for certain testing problems. 
There are still other methods of finding tests. One method is sketched on pages 
456 to 459 of Subsec. 5.4 of this chapter. Another method, which might be 
called the confidence-interval method, is to use a confidence interval to obtain a 
test. For instance, if it is desired to test 3f 0 : 6 = 6 0 versus Jfj : 0 # O 0 , then 
we might compute a confidence-interval estimate of 6 from the data, and if the 
interval contains 6 0 , accept Jf 0 , and otherwise, reject it. If the confidence 
interval had a confidence coefficient y, then the resulting test would have size 
1 — y. We will say more about how confidence intervals might be used to 
obtain tests in Sec. 6 below. 

A useful and intuitive technique for obtaining tests is the following: 
Discover some statistic which behaves differently under the two hypotheses, and 


426 TESTS OF HYPOTHESES 


IX 


utilize the different behavior to design a test. As an illustration, consider testing 
: 0 <0 o versus jf 1 :0>0 o , where the sample X lt . X„ is selected from 
the density f(x; 9) = <pg,i( x )- The statistic X has a normal distribution with 
mean 9 and variance l/n; hence the statistic X will tend to be smaller when ,?f 0 
is true than when 0 is false. The statistic X behaves differently under the two 
hypotheses. A reasonable test, then, would be to reject for X large; that is, 
reject ^f 0 if A > k, where k is determined by, say, fixing the size of the test. 
(We know from Subsec. 3.2 above that such a test is uniformly most powerful.) 
To employ this technique, a statistic has to be discovered which will behave 
differently under the two hypotheses. There are various ways of approaching 
the task of discovering such a statistic. For instance, if a sufficient statistic 
exists, then it is a natural candidate to try; or a good estimator, such as a maxi¬ 
mum-likelihood estimator, of the parameter or parameters that are used to 
specify the hypotheses is another possibility for the needed statistic. In the 
above simple illustration X was all these since X is the maximum-likelihood 
estimator of 9 as well as being a sufficient statistic. We make frequent use of 
this intuitive technique for obtaining tests in the remaining sections. 


EXAMPLE 17 Let X lt ..., X n be a random sample from a Poisson distribution 
with mean 9. Suppose that it is desired to test that the mean is a fixed 
value, say 9 0 ; that is, test J^ 0 : 9 = 9 0 versus t : 9 # 9 0 . We know that 
X is the maximum-likelihood estimator of 9 and that X will tend to be 
distributed about 9 0 if ./f 0 is true. Consequently, the following test seems 
reasonable: Accept Xf 0 if c 2 < x < c 2 , and otherwise reject it, where c 1 and 
c 2 are selected so that the test will have a desired size. To be specific, let 
n = 10 and 9 0 = 1. The test given by “Accept 0 if and only if .4 < x < 
1.6” has size given by ' 


1 -P,.,[.4<X<1.6] = 1 -Ea =1 [4<X*i< 16] 


= 1-1 
5 


15 e -10 lQ J 


.078' 


//// 


A test that has been quite extensively applied in various fields of science is 
jf 0 : 9 = 9 0 against : 9 # 9 0 . For example, let 9 be the mean difference of 
yields between two varieties of wheat. It is often suggested that it is desirable 
to test the hypothesis Jf 0 : 9=0 against ; 9 # 0, that is, to test if the two 
varieties are different in their mean yields. However, in this situation, and 
many others where 9 can vary continuously in some interval, it is inconceivable 
that 9 is exactly equal to 0 (that the varieties are identical in their mean yields). 
Yet this is what the test is stating: Are the two mean yields identical (to one 



3 


COMPOSITE HYPOTHESES 427 


ten-billionth of a bushel, etc.)? In many cases it seems more realistic for an 
experimenter to select an interval about 0 O , say 0, < 0 O < 0 2 , and test 0 : 0! < 
0 <0 2 against the alternative ,#j : 0 <0 t or 0 > 0 2 . For example, it may be 
feasible to set = —j and 0 2 = \ in the above illustration and test if the 
difference of the mean yields of the two varieties is between bushel and +j 
bushel against the alternative that it is not in this interval. A test that is 
uniformly most powerful for the above hypothesis may be difficult or impossible 
to devise, but if/(x; 0) is a density with a single parameter, then the maximum- 
likelihood estimator © may sometimes be used to construct a test and the power 
of this test compared with the ideal power function for a test of size a. A test 
of the following form may be used for some densities: Reject Jf 0 if 9 is not in 
some interval, say (c u c 2 ), and accept Jf 0 if 9 is in the interval, where q and c 2 
are chosen so that the test has size a. Often c, and c 2 can be chosen so that 

CU0; 0,) d0 = f C f & 0; 0 2 ) dd=\-«, 

J C 1 J Cl 

where/ s (0; 0) is the density of © when 0 is the parameter. The power function 
of this test is 

7 1 ( 0 ) = 1 - 0) d9 for 0 in 0. 

i 

This power function can be compared with the ideal power function, and if it 
does not deviate further from the ideal than the experimenter can tolerate, the 
test may be useful even though it may not be a uniformly most powerful test. 
Let us illustrate the above with a simple example. 

EXAMPLE 18 Let Aj, ..., X n be a random sample from </> e i (x). Test 
.?f o : 1 < 0 < 2 versus : 0 < 1 or 0 > 2. X is the maximum-likelihood 

estimator of 0; it has a normal distribution with mean 0 and variance 1 /«. 
According to the above we would like to select c A and c 2 so that 

r>C2 -C2 

1 - a = <j> u 1/n (x) dx = cj> 21 1/B (x) dx. 

J c i J a 

We have 



and we can see from Fig. 5 that c t = f - d and c 2 = f + d, where d is given 
by, say, 
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FIGURE 6 



For example, if a = .05 and n = 16, then .911; so g « .589, and 
c 2 « 2.411. The power function is given by 

n(&) = 1 - P e [ Cl < X < c 2 ] = 1 - P e [.589 < X < 2.411] 

and is sketched in Fig. 6. //// 


4 TESTS OF HYPOTHESES—SAMPLING 
FROM THE NORMAL DISTRIBUTION 

A number of the foregoing ideas are well illustrated by common practical testing 
problems—those problems of testing hypotheses concerning the parameters of 
normal distributions. The section is subdivided into four subsections, the first 
two dealing with just one normal population and the last two dealing with several 
normal populations. 

4.1 Tests on the Mean 

We shall assume that we have a random sample of n observations Aj, ..., X n 
from a normal population with mean /i and variance a 2 , and we will be interested 
in testing hypotheses about fi. There is quite a variety of hypotheses about the 
mean ju that can be formulated; we begin by considering one-sided hypotheses. 




4 


tests of hypotheses—sampling from the normal distribution 429 


o : fi < no versus In testing Xf 0 :n^n 0 versus Xe j: n > n 0 there 

are two cases to consider depending on whether or not a 2 is assumed known. 
If a 2 is assumed known, our parameter space is the real line, and we are testing a 
one-sided hypothesis; so we have hope of finding a uniformly most powerful 
test. Since o 2 is assumed known, it is a known constant, hence 


y/2ncj 

_ I e ~i0 ‘l’’) 2 e -i(xl<r) 2 e Qtla)x 

y/lna 

which is a member of the exponential family with 


a(n) = 


1 

■Jlna 




b(x) = e- i < xl ° ) \ c(n) = 

o 


and d(x) = x. 


The conditions for Theorem 5 are satisfied; so the uniformly most powerful size- 

n 

a test is given by the following: Reject Xf 0 if J(x 1 ...., x„) = X Xj > k *, where k* 

i 

is given as a solution to P„ 0 [X X t > k*] = a. Now a = P„ 0 [X X t > &*] = 
1 - - n(i 0 )/y/no ); so (k* - nn 0 )ljno = z l _ x , where Zj is the (1 - a)th 

quantile of the standard normal distribution. The test becomes the following: 
Reject XCq if X x i > n 8o + \J na2 \ > or reject if x > n 0 + (a/ y /n)z 1 _ x . 

If a 2 is assumed unknown, then testing Xf 0 : n < Ho versus Xf x : >u > yu 0 is 
equivalent to testing 2^ 0 \ 9 e 0 O versus Xf ^: 8 $O 0 , where 6 = (/i, cr 2 ), 0 = 
{(ju, cr 2 ): — oo < fi < oo ; <r 2 >0}, and 0 O = {(n, a 2 ): n < n 0 ; a 2 > 0}. Toobtain 
a test, we could use the generalized likelihood-ratio principle, or we could find 
some statistic that behaves differently under the two hypotheses and base our 
test on it. Such a statistic is T = (X — [i 0 )/(S/^j'n), where X is the sample mean 
and S 2 is the sample variance. Since T would tend to be larger for n > n 0 than 
for n <, fi 0 , a test based on T is given by the following: Reject XC 0 if T is large; 
that is, reject XP 0 if T > k. If ju = ju 0 , then T has a t distribution with n — 1 
degrees of freedom; so k can be determined by setting a = P„ =lt0 [T > k], which 
implies that k = fi _„(« - 1), the (1 - a)th quantile of a i distribution with n — 1 
degrees of freedom. It can be shown that the test derived here is a generalized 
likelihood-ratio test having size a. 


■&o- n = Li 0 versus XC x : # /i 0 Again, we have two cases to consider depend¬ 

ing on whether or not a 2 is assumed known. For o 2 known, we know that 
(*- z (i+y>/ 2 (<*/>/«), X + z ( i + 1 , )/2 ((7/ n /«)) is a lOOy percent confidence interval 
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for n, where z (1 +y)/2 is the f(l + y)/2]th quantile of the standard normal distribu¬ 
tion. A possible test is given by the following: Reject if the confidence 
interval does not contain fi 0 . Such a test has size 1 — y since 

[_ a — a ] 

Pfi=Ho X Z (1 + y)/2 ^ /*0 ^ X + ^(l + y)/2 7= I y* 

L V n yJnA 

If <7 2 is assumed unknown, we could obtain a test, similar to the one above, 
using the lOOy percent confidence interval 



S _ 

t(l+y)/2( n — 1 ) 1= ■> % + t( l+y y 2 ( n ~ 1 ) 


y 


9 


Instead, let us find a generalized likelihood-ratio test. 


^ x ‘* - - ^ - ( 7 k)’ exp [■ \ • 


So = {(M> o 2 ): H = Ho\ <t 2 > 0}, and 0 = {(/i, a 2 ): -oo < h < oo; a 2 > 0}. We 
have already seen that the values of h and a 2 which maximize L(h, a 2 ; x,,..., x„) 
in 0 are p. = x and a 2 - (l/«) X ( x i ~ x ) 2 1 so 


sup L(n, a 2 ; x u ..., x„) 

a 


nn/2 


[2k X (X; - X) 2 


-n/2 


To maximize L over 0 O , we put h = Mo > and the only remaining parameter is 
<r 2 ; the value of a 2 which then maximizes L is readily found to be o 2 = 
(1 /«) X ( x > ~ Mo) 2 , which gives 

sup L{n, a 2 ; x u ...,x n ) = 

@0 

The generalized likelihood-ratio is then 


‘ln/z 


I2n £ (.- no) 2 


-n/2 


-[ 

■[ 


x (x ; - x) 2 y' 2 


X (*< - x) 2 


T n/2 


X (*» ~ Mo) 2 ] 
X(*« 


LX ( x ; - * + * 

x) 2 "1" /2 


Ato) J 


“ n/2 r 

LX ( X- t - x) 2 + n(x - n 0 ) 2 _ [l +n(x- ju 0 ) 2 /X (*i “ *) 2 
We note now that 2 is a monotonic function of 


1 


n/2 


( 16 ) 


t 2 = / 2 (x 1; 


X„) = 


n(n - l)(x - fi 0 ) 2 
~\2 




x - Ho 


X (x ; - x) 2 LvX - *) 2 /(" ~ ^ 
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and so a critical region of the form X < X 0 is equivalent to a critical region of the 
form X 2 (x u ..., x„) > k 2 . A generalized likelihood-ratio test is then given by 
the following: Reject 3V 0 if and only if 



or accept Jf 0 if and only if — k < T < k. Since T has a t distribution with n — 1 
degrees of freedom when /i = /i 0 , if k is selected so that j ~ k f T (t; n — 1) dt = 
1 — a, then our test will have size a. k is given by ~ a/2 (n — 1), the (1 — a/2)th 
quantile of a t distribution with n - 1 degrees of freedom. We might note that 
this size-a test obtained by using the generalized likelihood-ratio principle is the 
same size-a test that we obtained above using the confidence-interval method 
of obtaining tests with the confidence interval 

/_ S _ S \ 

1^ - ? (l+y)/2( W — 1) — J=i % + *(l+y)/2( rt — 1) /= I» 

\ V" v m/ 

where y = 1 — a. Although we will not prove it, the test that we have obtained 
is uniformly most powerful unbiased. 

We have found tests on the mean of a normal distribution for both one¬ 
sided and two-sided hypotheses. One might note that the one-sided null hy¬ 
pothesis fi< Ho could be reversed and comparable results obtained. There are 
other hypotheses about the mean that could be formulated, such as ,2f 0 : Hi < 
H < Hz versus ^: H< Hi or h> Hi- 

4.2 Tests on the Variance 

As in the last subsection, we shall assume that we have a random sample of size n 
from a normal population with mean fi and variance a 2 . We will be interested 
in testing hypotheses about a 2 . 

a : u 2 < <7 q versus j: a 2 > al There are two cases to consider depending on 
whether or not n is assumed known. If is known, then our parameter space 
is an interval, and our hypothesis is one-sided; so we have a chance of finding a 
uniformly most powerful size-a test. 

f(x; 9) =f(x; a 2 ) = -±=- e-a/ 2 * 2 )^) 2 , 

y/2na 

which is a member of the exponential family with a(a 2 ) = (2na 2 )~ i , b(x) = 1, 
c ( ff2 ) = — l/2u 2 , and d(x) = (x — ju) 2 . [h is known; so d(x) is a function of x 
only ] c(( t 2 ) is a monotone, increasing function in a 2 ; so, by Theorem 5, the 
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test with critical region = {(x lf ..., x„): £ (x ; — n) 2 > k*} is uniformly most 
powerful of size a, where k* is given by P„ 2 =„ 0 i [X ~ AO 2 > k*\ = a, which 
implies that k* = ffo Xi-J n )> where Xi-a( n ) is the (1 — a)th quantile point of the 
chi-square distribution with n degrees of freedom. 

If n is unknown, a test can be found using the statistic V = £ (X t - X) 2 jal. 
V will tend to be larger for a 2 > <Tq than for a 2 < ; so a reasonable test would 

be to reject Jf 0 for V large. If a 2 = , then V has a chi-square distribution 

with n- 1 degrees of freedom, and P„ 2 =ao i[V >x\-Jji~ 1)] = a » where 
Xi-a( n ~ 1) is the (1 — a)th quantile of a chi-square distribution with n — 1 
degrees of freedom. It can be shown that the test given by the following: 
Reject ^f 0 if and only if £ (V; - X) 2 /al > x\-J< n — 1) is a generalized likeli¬ 
hood-ratio test of size a. 

Jt 0 : a 2 = Uq versus : a 2 # al We leave the case n assumed known as an 
exercise. For fi unknown, so that 0 O = {(/i, a): — co < (i <co; a 2 = <Jq}, we 
can find a size-a test using the confidence-interval method. In Subsec. 3.2 of 
Chap. VIII, we found the following 100-/ percent confidence interval for a 2 : 

t {n - 1)S 2 (,n - 1)S 2 \ 
l ?2 ’ <7i / 

where q x and q 2 are quantile points of a chi-square distribution with n - 1 degrees 
of freedom, say f Q (q; n - 1), satisfying 


92 

f Q {q- n — \) dq = y. 

1 1 

A size-(a = 1 — y) test is given by the following: Accept if and only if is 
contained in the above confidence interval. It is left as an exercise to show that 
for a particular pair of q x and q 2 the test of size a derived by the confidence- 
interval method is in fact the generalized likelihood-ratio test of size a. 


4.3 Tests on Several Means 

In this subsection we will consider testing hypotheses regarding the means of two 
or more normal populations. We begin with a test of the equality of two means. 

Equality of two means In many situations it is necessary to compare two 
means when neither is known. If, for example, one wished to compare two 
proposed new processes for manufacturing light bulbs, one would have to base 
the comparison on estimates of both process means. In comparing the yield of 
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a new line of hybrid corn with that of a standard line, one would also have to use 
estimates of both mean yields because it is impossible to state the mean yield of 
the standard line for the given weather conditions under which the new line 
would be grown. It is necessary to compare the two lines by planting them in 
the same season and on the same soil type and thereby obtain estimates of the 
mean yields for both lines under similar conditions. Of course the comparison 
is thus specialized; a complete comparison of the two lines would require tests 
over a period of years on a variety of soil types. 

The general problem is this: We have two normal populations—one with 
a random variable X t , which has a mean /q and variance a\, and the other with 
a random variable X 2 , which has a mean (i 2 and variance c\ . On the basis of 
two samples, one from each population, we wish to test the null hypothesis 

: /ti = Hi > o\ > 0, o\ > 0 versus Xf # n 2 , o\ > 0, a\ > 0. 

The parameter space 0 here is four-dimensional; a joint distribution of Aj and 
X 2 is specified when values are assigned to the four quantities (/q, fi 2 , a\, erf). 
The subspace 0 O is three-dimensional because values for only three quantities 
(/i, a\, cr|) need be specified in order to specify completely the joint distribution 
under the hypothesis that /Zj = /z 2 = n, say. 

We shall suppose that there are n 1 observations (X n , X 12 , ..., in 
the sample from the first population and n 2 observations (X 2l , X 22 , ..., X 2 „ 2 ) 
from the second. The likelihood function is 


^ 0 * 1 » , o\, 02 ; * 11 , • • • > x ln,’ * 21 , • • • > X 2n 2 ) ~ L 


and its maximum in 0 is readily seen to be 


sup L = 
5 


«1 

ni/2 

n 2 ] 

Itl 

2rcX( x i; - *i) 2 

L i J 


«2 

2 n Z ( x 2 j - x 2 ) 2 

L 1 J 


~ n ' l2 e 


-ni/2 


If we put fa and /i 2 equal to n, say, and try to maximize L with respect to n, a\, 
and (j \, it will be found that the estimate of /i is given as the root of a cubic 
equation and will be a very complex function of the observations. The resulting 
generalized likelihood-ratio X will therefore be a complicated function, and to 
find its distribution is a tedious task indeed and involves the ratio of the two 
variances. This makes it impossible to determine a critical region 0 < X < k 
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for a given probability of a Type I error because the ratio of the population 
variances is assumed unknown. A number of special devices can be employed 
in an attempt to circumvent this difficulty, but we shall not pursue the problem 
further here. For large samples the following criterion may be used: The root 
of the cubic equation can be computed in any instance by numerical methods, and 
X can then be calculated; furthermore, as we shall see in Sec. 5 below, the 
quantity — 2 log A has approximately the chi-square distribution with one 
degree of freedom, and hence a test that would reject for —2 log X large could 
be devised. 

When it can be assumed that the two populations have the same variance, 
the problem becomes relatively simple. The parameter space 0 is then three- 
dimensional with coordinates (/q, /i 2 , <r 2 ), while 0 O for the null hypothesis 
= H 2 = H is two-dimensional with coordinates (ji, a 2 ). In 0 we find that 
the maximum-likelihood estimates of /q, n 2 , and a 2 are, respectively, x u x 2 , 
and 


«! + n 2 


I (*n - *i ) 2 + I &2j ~ *2) 

Ll 1 


']> 


SO 


sup L = 
s 


n 1 + n 2 


(m + n 2 )/2 


e ~(ni + n 2 )/2. 


IMI Oi; - *i) 2 +E (*2 j ~ x 2 f] 

In 0 O , the maximum-likelihood estimates of fi and a 2 are 

1 i 2 \ n 1 x i +n 2 x 2 

ix = —-— =-—- for fx 

«i + n 2 \i 1 7 «i + n 2 


and 


«i + «■ 


[I (*ii - fi ) 2 + I (x 2J ~ fi) 2 ] 


= — -7 — [1 (^li - ^l) 2 + I (x 2 j - X 2 ) 2 + — l ~ (x x - x 2 ) 2 

n l +«2 L n l + «2 J 


which gives 

sup L 
So 


n 1 + n 2 


for a 2 


(n>+» 2 )/2 


E ( X U ~ Xj ) 2 + E ( X 2J - X 2 f + -Ei^ 2 - ( Xl _ x 2 ) 2 

n l + n 2 


x e 


-(m +n 2 )/2 



4 


TESTS OF HYPOTHESES—SAMPLING FROM THE NORMAL DISTRIBUTION 435 


Finally, 


_ / EWfo + « 2 )Pi - x 2 ) 2 y to+nw 
\ I (*i; - *i) 2 + Z( x 2j ~ *2) 2 / 


(17) 


This last expression is very similar to the corresponding one obtained in 
Subsec. 4.1, and it turns out that this test can also be performed in terms of a 
quantity which has the t distribution. We know that and X 2 are independ¬ 
ently normally distributed with means /q and fi 2 and with variances a 2 /n l and 
a 2 jn 2 . Also it is readily seen that X l - X 2 is normally distributed with mean 
Hi - fi 2 and variance u 2 (l/«i + l/n 2 )- Under the null hypothesis the mean of 
X t - X 2 will be 0. The quantities X (X u - Xtf/c 2 and £ (X 2J - X 2 f/a 2 are 
independently distributed as chi-square distributions with n, - 1 and n 2 - 1 
degrees of freedom, respectively; hence their sum has the chi-square distribution 
with «! + « 2 - 2 degrees of freedom. Since under the null hypothesis 


Xi-Xi 
<s/l/«l + l/«2 


is normally distributed with mean 0 and unit variance, the quantity 

T = _ J «i« 2 /(”i + n 2 )(X t - X 2 ) _ 

VE Wu - ^) 2 +1 & 2i ~ ^ 2 ) 2 ]/(«i + n 2 - 2) 

has the t distribution with n { + n 2 — 2 degrees of freedom. [Note that we do 
have independence of the numerator and denominator in Eq. (18).] The gener¬ 
alized likelihood-ratio is 


[ ] -l(ni+n 2 )/2 

1 +[t 2 /(«i +« 2 —2)]J ’ (19) 

and its distribution is determined by the t distribution. The test would, of 
course, be done in terms of T rather than 2. A 5 percent critical region for T is 
T 2 > [t. 97 s(«i + « 2 - 2 )] 2 , where t. 975 (« 1 +n 2 - 2) is the ,975th quantile of the 
t distribution with n { + n 2 — 2 degrees of freedom. 

If we want to test 3^ 0 \ /q = n 2 versus 3# y : /q > fi 2 or Jf 0 : Vi ^ Hi versus 
•7f!: /q > jx 2 , a size-a test is given by the following: Reject jf 0 if and only if 
T > ?!_„,(«! + « 2 - 2), where Tis defined in Eq. (18) and t l _ a (n l + n 2 - 2) is the 
(1 — a)th quantile of the t distribution with n 1 + n 2 — 2 degrees of freedom. 

Equality of several means The test presented above can be extended from 
just two normal populations to k normal populations. We assume that we have 
available k random samples, one from each of k normal populations; that is, 



436 TESTS OF HYPOTHESES 


IX 


let Xji, ..., X Jnj be a random sample of size n } from the jth normal population, 
j = l, ..., k. Assume that the jth population has mean Hj and variance a 2 . 
Further assume that the k random samples are independent. Our object is to 
test the null hypothesis that all the population means are the same versus the 
alternative that not all the means are equal. We seek a generalized likelihood- 
ratio test. The likelihood function is given by 


Ujiu <r 2 ; * 11, . • • , ^1,1,5 • • • 9 -^fcl? • • • 9 Xkrth ) 

k nj 1 

= n n-7=- . e ~il(xjt~ftj)l<T] 2 

J = i *=i \/2no 


(2na 2 ) n/2 exp - £ £ (xj, - Hj) 2 , 

La j=i i=i 


k 

where n = ]T rij. 

7=1 _ 

The parameter space 0 is (k + l)-dimensional with coordinates 
(Hi, ■■■, n k , a 2 ), and 0 O , the collection of points in the parameter space corre¬ 
sponding to the null hypothesis, is two-dimensional with coordinates (n, a 2 ), where 
H = Hi = • •' = Hk- In 0, the maximum-likelihood estimates of /q, ..., n k , a 2 
are given by 


1 nj 

= = x n> J=l---,k, 

n }l =i 


and 


\ k nj 

*l = -1 i(*/.--*7.) 2 ; 

n j= i i=i 


hence, 


2n Z 'L&ji- X j)\-n/2 


sup L 


r “ L. L- V" 
_ i j 

n 


,-»/2 


In 0 O , the maximum-likelihood estimates of h and a 2 are 


t k nj 

/*=*=-£ z ^ 

n j=i i=i 


and 6g 0 = - Z Z ( x ji ~ x) 2 

n j=i i= i 


( 20 ) 


sup L = 
So 



— «/ 2 


e~ n/2 . 


and so 
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The generalized likelihood-ratio is then 


sup L 

2 _ Qo _ 

sup L 
e 


Ji _ 


I ZCxjt-Zjf' 

■ j i 

'EE (Xji - Xj. + Xj . - X) 2 ] -«/ 2 


I E (Xji - Xjf 


J i 


E 1 (*jt - Xj .) 2 +1 n j(*J- ~ *) 2 1 ~ nl2 
j ‘ _ l _ 


E e (xm - x^ 


j i 


1 + 


k _ { E n i( x J. ~ x ) 2 K k ~ !) 


n ~ k EE ( x Ji - XjfKn - k) 

J i 


-n/2 


A generalized likelihood-ratio test is given by the following: Reject ^f 0 if and 
only if A < A 0 . But A < A 0 if and only if 


E n j( x J- ~ x) 2 /( k - 1 ) 

r - iZ -—y---> some constant, say c. (21) 

EE (Xji ~ xjf l(n- k) 
j i 

The ratio r is sometimes called the variance ratio, or F ratio. The constant c is 

determined so that the test will have size a; that is, c is selected so that 

jP[7? > c\je 0 ] = a. Note that X h is independent of E (k'j; - Xjf and, hence, the 

i 

numerator of Eq. (21) is independent of the denominator. Also, under , 
note that the numerator divided by u 2 has a chi-square distribution with k — 1 
degrees of freedom, and the denominator divided by a 2 has a chi-square dis¬ 
tribution with n — k degrees of freedom. Consequently, if 0 is true, R has an 
F distribution with k - 1 and n-k degrees of freedom; so the constant c is the 
(1 — a)th quantile of the F distribution with k — 1 and n — k degrees of freedom. 

The testing problem considered above is often referred to as a one-way 
analysis of variance. In some experimental situations, an experimenter is 
interested in determining whether or not various possible treatments affect the 
yield. For example, one might be interested in finding out whether various 
types of fertilizer applications affect the yield of a certain crop. The different 
treatments correspond to the different populations, and when we test that there 
is no population difference, we are testing that there is no “treatment” effect. 
The term “ analysis of variance ” is explained if we note that the denominator of 
the ratio in Eq. (21) is an estimate of the variation within populations and the 



438 TESTS OF HYPOTHESES 


IX 


numerator is an estimate of the variation between populations when means are 
equal. We are analyzing variance to test equality of means. 


4.4 Tests on Several Variances 


Two variances Given random samples from each of two normal populations 
with means and variances Qq, of) and (n 2 , of), we may test hypotheses about 
the two variances. We will consider testing: 

(i) XC 0 \ a\ < of versus X? x \ a\> of 

(ii) : of > of versus Xd y : of < of 

(iii) Xf 0 : of = of versus XP x : of # of 

If X llt ..., X lni is a random sample from a normal density with mean /q 

and variance of, if X 2l , ..., X 2ni is a random sample from a normal density 
with mean /i 2 and variance of, and if the two samples are independent, then we 
know that 

£(* 2i -X 2 )> 2 -l)of 


has the F distribution with /q - 1 and n 2 — 1 degrees of freedom, and, in partic¬ 
ular, the statistic 


("a-PI (*!,-*.) 

(»i- 1)I(V 2 ,.-X 2 ) 2 


has the /’distribution with n x — 1 and n 2 — 1 degrees of freedom when of = a\. 
Note that the statistic R tends to be large when o\ > a\ and small when erf < erf, 
and so we can capitalize on this different behavior to formulate tests for the 
hypotheses (i) to (iii). For instance, in testing Xf 0 : a\ < a 2 versus : erf > erf, 
we would reject .?f 0 for large R, or a size-a test is given by the following: Reject 
Xf (s if and only if R exceeds F i _ a (n l — 1, n 2 — 1), the (1 - a)th quantile of the 
F distribution with /q — 1 and ti 2 — 1 degrees of freedom. Similarly, a test of 
a\ > versus -Xf x : n\ < a\ is given by the following: Reject .?f 0 if and 
only if R is less than FJn l — 1, n 2 - 1), the ath quantile of the F distribution 
with - 1 and n 2 — 1 degrees of freedom. A test of Jf 0 : erf = erf versus 

Xe y : of # a 2 should be two-tailed-, that is, should be rejected for small or 

large R■ In other words, a test is given by the following: Accept XF 0 if and only 
if k 2 < R < k 2 , where k y and k 2 are selected so that the test will have size a. 
It is customary to make the two tails have equal areas of a/2 (although this 
is not quite the best test); then k 1 =F a/2 (n 1 - 1, n 2 — 1), and k 2 = 
-Fi-a/aC*! - 1, « 2 - !)• 
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We might mention that the above defined tests can all be derived using the 
generalized likelihood-ratio principle. 


Equality of several variances Let X Si ,X Jnj be a random sample of size n- 
from a normal population with mean fij and variance <rj,j = 1, ..k. Assume 
that the k samples are independent. Our object is to test the null hypothesis 
XC o: u\ = <r§ = • • • = o\ against the alternative that not all variances are equal. 
The likelihood function 


•f'C/tli •••>/**» ff l> • • • > &k 1 -*11> . .., • • ■ , > x ktth) 


k nj 

-nn 


i 


e ~i Uxjl-nAlaj ] 2 


II 11 /— 

j= i •= i <jj 

and the maximum-likelihood estimates of fij, a 2 ,, j = 1, ..., k are given by 


and 


1 " J 

fij = ~ .1 x n = X J- 

Hji = 1 




The null hypothesis states that all a) are equal. Let a 1 denote their common 
value; then 0 O = {(/h> • • • > Hk > c 2 ): — co < < oo; cr 2 > 0}, and the maximum- 

likelihood estimates of /q, ..., fx k , <r 2 over 0 O are given by 


and 


Therefore, 


fij = x j., j=l,...,k. 


1 k nj 

4 V—' T-l 


Z n j°j 


L n ) j=i i=i 2-, n j 


sup L 

X = ^ 


/ 1 \Znj/2 

Xp.) ex P(-I«j/2) 


sup L * /1 W 2 _ 

n y exp(-XV2) 


n (^) nj/2 

A generalized likelihood-ratio test is given by the following: Reject 0 if and 
only if X < X 0 . We would like to determine the size of the test for any constant 
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X 0 or find X 0 so that the test has size a, but, unfortunately, the distribution of the 
generalized likelihood-ratio is intractable. An approximate size-a test can be 
obtained for large n s since it can be proved that —2 log A is approximately 
distributed as a chi-square distribution with k — 1 degrees of freedom. Accord¬ 
ing to the generalized likelihood-ratio principle is to be rejected for small 
X', hence ^f 0 should be rejected here for large —2 log 2; that is, the critical region 
of the approximate test should be the right tail. So the approximate size-a test 
is the following: Reject Jf 0 if and only if —2 log X >y A -Jk - 1), the (1 - a)th 
quantile of the chi-square distribution with k — 1 degrees of freedom. (Several 
other approximations to the distribution of the likelihood-ratio statistic have 
been given, and some exact tests are also available.) 


5 CHI-SQUARE TESTS 

In this section we present a number of tests of hypotheses that one way or another 
involve the chi-square distribution. Included will be the asymptotic distribution 
of the generalized likelihood-ratio, goodness-of-fit tests, and tests concerning 
contingency tables. The material in this section will be presented with an aim 
of merely finding tests of certain hypotheses, and it will not be presented in such 
a way that concern is given to the optimality of the test. Thus, the power 
functions of the derived tests will not be discussed. 


5.1 Asymptotic Distribution of Generalized Likelihood-ratio 

On two occasions in Sec. 4 we found that the distribution of the generalized 
likelihood-ratio was intractable, and both times we indicated that an approximate 
test could be obtained by using an asymptotic distribution of the generalized 
likelihood-ratio. The following theorem, which we shall not be able to prove 
because of the advanced character of its proof, gives the asymptotic distribution 
of the generalized likelihood-ratio. 


Theorem 7 Let X u ..., X n be a sample with joint density / Xl ... x „ 
0), where 9 =(9 t ,9 k ), that is assumed to satisfy quite general 
regularity conditions. Suppose that the parameter space 0 is ^-dimen¬ 
sional. In testing the hypothesis 


jr 0 :9 1 = 9° 1 ,...,9 r = 6?,6 r+1 ,...,6 k . 
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where 0 °,..., 0° are known and 0 r+l ,..0 k are left unspecified, -2 log A„ 
is approximately distributed as a chi-square distribution with r degrees of 
freedom when is true and the sample size n is large. //// 

We have assumed that 1 < r < k in the above theorem. If r = k, then all 
parameters are specified and none is left unspecified. The parameter space © is 
^-dimensional, and since #d 0 specifies the value of r of the components of 
(0j, ..., 0 k ), the dimension of 0 O is k - r. Thus, the degrees of freedom of the 
asymptotic chi-square distribution in Theorem 7 can be thought of in two ways: 
first, as the number of parameters specified by 0 and, second, as the difference 
in the dimensions of 0 and 0 O • 

Recall that A„ is the random variable which has values 

K = sup L(0j,. • •, 0 k ; *i,..., x„)/sup L(0 1( ..., 9 k ; x u ..., x n ). 

So e 

which in turn is the generalized likelihood-ratio for a sample of size n. 0 O is 
that subset of 0 that is specified by 0 . The generalized likelihood-ratio prin¬ 
ciple dictates that Q is to be rejected for X„ small, but since -2 log ), n increases 
as ). n decreases, a test that is equivalent to a generalized likelihood-ratio test is 
one that rejects for - 2 log X„ large. Now, since the theorem gives an approx¬ 
imate distribution for the values - 2 log /„ when 3^ 0 is true, a test with approx¬ 
imate size a is given by the following : 

Reject ,i>f 0 if and only if -2 log l n > Xi-Jf), 

where Xi-a( r ) i s the (1 - a)th quantile of the chi-square distribution with r 
degrees of freedom. Note that the degrees of freedom r is the number of compo¬ 
nents of the parameter space that are specified by the null hypothesis. 

Because of the specific form of the null hypothesis in the theorem, it may 
appear that the result is not too widely applicable. The null hypothesis of the 
theorem specifies the values of a subset of the k components of the ^-dimensional 
parameter space, and not many null hypotheses are of that form. However, 
often the density can be reparameterized so that the null hypothesis is of the 
form given in the theorem. We illustrate with two examples. 


EXAMPLE 19 Recall that in Subsec. 4.3 we discussed testing Jf 0 : p 1 = p 2 > 
dj > 0, (j\ > 0 versus j: /q #= ju 2 , erf > 0, erf > 0, where jq and a\ are 
the mean and variance of one normal population and p 2 and o\ are 
the mean and variance of another. Here the parameter space is four¬ 
dimensional, and although ^f 0 does not appear to be of the form given 



442 TESTS OF HYPOTHESES 


IX 


in Theorem 7, we can reparameterize to make it of that form. Let 
= /q — p 2 > #2 = /*2 > $3 = ff i > and 9 A = o\. In terms of the reparam¬ 
eterization, o becomes Jf 0 : = 0\ = 0, 0 2 , d 3 , 0 4 ; that is, the com¬ 

ponent 0 t is specified to be 0, and the remaining three components are 
unspecified. The theorem is now applicable for the reparameterization; 
that is, the asymptotic distribution of —2 log A' is known (and is the chi- 
square distribution with one degree of freedom) for 0 true, where A' is 
the generalized likelihood-ratio obtained under the reparameterization. 
However, because of the invariance property of maximum-likelihood 
estimators, the generalized likelihood-ratio A' obtained under the reparam¬ 
eterization is the same as the generalized likelihood-ratio A obtained 
before reparameterization. //// 


EXAMPLE 20 In Subsec. 4.4 we tested Jt 0 : <j\ = • • • = /q, ..., p m , where 

Hj and aj were, respectively, the mean and variance of the yth normal 
population, j = 1, ..., m. (In Subsec. 4.4, k was used instead of m.) If 
we make the following reparameterization, 0 will have the desired form 
of Theorem 7: 


0i = = ^ l 2 1 ,o m = oi,e m+i =n 1 ,...,e K 


Now becomes je o : 0 t = 1, ..., 9 m _ l = 1, 0 m , 0 m+1 , ..., 9 2m ; that is, 
the first m — 1 components are specified to be 1 and the remaining are 
unspecified. Theorem 7 is now applicable, and, again, because of the 
invariance property of maximum-likelihood estimates, the generalized 
likelihood-ratio obtained before and after reparameterization are the 
same; hence the asymptotic distribution of —2 log A, as claimed in 
Subsec. 4.4, is the chi-square distribution with m — 1 degrees of freedom 
when o is true - //// 


5.2 Chi-square Goodness-of-fit Test 

We commence this section with an example of a testing problem that involves 
the specification of the parameters of a multinomial distribution. It is hoped 
that this example will help motivate the presentation of the goodness-of-fit test. 
If a population has a multinomial density 

*+i 

f(x 1 ,...,x k ;p l ,...,p k )= [I Pj J > 

j=i 


( 23 ) 



5 


CHI-SQUARE TESTS 443 


k+1 

where x s = 0 or 1, j = 1, ..., k + 1; 0 < pj < 1, j = 1, ..., k + 1; Y x, = 1; 

s= i 

*+i 

and Y Pj = 1 ( as would be the case in sampling with replacement from a 
j - 1 

population of individuals who could be classified into k + 1 classes or categories), 
a common problem is that of testing whether the probabilities pj have specified 
numerical values. Thus, for instance, the result of casting a die may be classified 
into one of six classes, and on the basis of a sample of observations we may wish 
to test whether the die is true, that is, whether Pj = i for j = 1,...» 6. One can 
also think in terms of independent, repeated trials, where each trial can result in 
any one of k + 1 outcomes, called classes or categories. The density in Eq. (23) 
then gives the density for the outcome of one trial. The result of one trial can 
be represented by the multivariate random variable (X t , ..., X k ), where Xj is 
unity if the trial results in category j and is 0 otherwise, pj is the probability 
that a trial results in category j. Now if we independently repeat the trial n 
times, we have n observations of the multivariate random variable (X it X k ); 
we can display them as 

(X n ,..., X lk ), ( X 2 i, •.., X 2k ), •.., ( X nl . X nk ). 

n 

If we let Nj — Y Xtj > l hen the random variable Nj is the number of the n trials 

i= 1 

resulting in category j. We know that (N 1 , ..., N k ) has a multinomial distribu¬ 
tion. (See Example 5 in Subsec. 2.2 of Chap. IV.) 

To test the null hypothesis ^f 0 : Pj = Pj, j = 1, ..., k + 1, where p° are 
given probabilities summing to unity, we hope to employ the generalized 
likelihood-ratio principle. The likelihood function is given by 

n k+l 

L = L(p lt ...,p k ;x lu ...,x lk , ...,x nl , ...,x nk )= H n Pj IJ - (24) 

i=i j= i 

The parameter space H has k dimensions (given k of the k + 1 p/s, the remaining 
one is determined by Y Pj = 0, while 0 O is a point. It is readily found that L is 
maximized in 0 when 


Pj 


n 



Xij 

n 


'll 

? 

n 


where rtj is a value of the random variable Nj. Hence, 
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The maximum of L over S 0 is its only value X[ (p° )% and so the generalized 

J= i 

likelihood-ratio is 


*+! lv°\ n J 

-»-n ^ • 

j = i \nj! 


A generalized likelihood-ratio test is given by the following: Reject ,?f 0 if and 
only if X < X 0 , where the constant A 0 is chosen to give the desired probability of 
a Type I error. For small n, the distribution of the generalized likelihood-ratio 
may be tabulated directly in order to determine X 0 ; for large values of n, we may 
use Theorem 7, which states that —2 log A has approximately the chi-square 
distribution with k degrees of freedom. The chi-square approximation is sur¬ 
prisingly good even if n is small provided that k >2. 

Another test which is still commonly used for testing 0 was proposed (by 
Karl Pearson) before the general theory of testing hypotheses was developed. 
This test uses the statistic 


0\2 


(Nj - n P j) 

Qk ~ L 


J -1 


(25) 


which tends to be small when M’q is true and large when is false. Note that 
Nj is the observed number of trial outcomes resulting in category j and np° is the 
expected number when is true. It can be easily shown (see Prob. 39) 
that 

£[Ql\ = X -*-o - Pj) + n2 (Pj ~ P°) 2 l ( 26 ) 

j = i nPj 

where the pj are the true parameters. If 0 is true, then S[Ql] = £ (1 - p°) = 
k + 1 — 1 = k. The following theorem gives a limiting distribution for Q% when 
the null hypothesis 0 is true. 


Theorem 8 Let the possible outcomes of a certain random experiment 
be decomposed into k + 1 mutually exclusive sets, say A lt ..., A k+l . 
Define pj = P[A } \, j = 1, ..., k + 1. In n independent repetitions of the 
random experiment, let Nj denote the number of outcomes belonging to 

k+l 

set Aj , j = 1, ..., k + 1, so that X ^i = n - Then 

j = 1 


Q k = 


k y (Nj - npj ) 2 

h i "Pj 


(27) 


has as a limiting distribution, as n approaches infinity, the chi-square 
distribution with k degrees of freedom. I III 
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We will not prove the above theorem, but we will indicate its proof for 
k — 1. What needs to be demonstrated is that for each argument x, 
converges to F x i (k) (x) as n -* co, where is the cumulative distribution 

function of the random quantity Q k and F x i ik) ( ) is the cumulative distribution 
function of a chi-square random variable having k degrees of freedom. (Note 
that k + 1, the number of groups, is held fixed, and n, the sample size, is increas¬ 
ing.) If k = 1, then 


(Nj - npi) 2 + ( N 2 - np 2 ) 2 = ( N_i - np y ) 2 
npi np 2 npi 

(n-N i -n + rip,) 2 = (N t - np k ) 2 
n(l-Pi) nPi(l~ Pi)' 


We know that Nj has a binomial distribution with parameters n and p, and that 
Y n = (N k — «Pi)/>/«Pi(l — Pi) has a limiting standard normal distribution; 
hence, since the square of a standard normal random variable has a chi-square 
distribution with one degree of freedom, we suspect that Y% = Q t has a limiting 
chi-square distribution with one degree of freedom, and such can be easily shown 
to be the case, which would give a proof of Theorem 8 for k = 1. 

Theorem 8 gives the limiting distribution for the statistic 


, 0\2 


at- ‘S'Sixm 


j= 1 


npj 


when the null hypothesis 0 : pj = p°, j = 1, ..., k + 1, is true. Thus a test of 
^o : Pj = Pj> j = 1> •••» k + 1, which has approximate size a, is given by the 
following: 


Reject 0 if and only if Q° k > xf- x (k), 

the (1 - a)th quantile of the chi-square distribution with k degrees of freedom. 
We now have two large-sample tests of the null hypothesis 3^o' Pj = Pj> 

i = 1. k + 1, the one just defined, which uses Theorem 8, and the other given 

in terms of the generalized likelihood-ratio, which uses Theorem 7. It can, in 
fact, be shown that the two tests are equivalent for large samples. 

EXAMPLE 21 Mendelian theory indicates that the shape and color of a 
certain variety of pea ought to be grouped into four groups, “ round and 
yellow,” “round and green,” “angular and yellow,” and “angular and 
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green,” according to the ratios 9/3/3/1. For n = 556 peas, the following 
were observed (the last column gives the expected number): 


Round and yellow 

315 

312.75 

Round and green 

108 

104.25 

Angular and yellow 

101 

104.25 

Angular and green 

32 

34.75 


A size-.05 test of the null hypothesis : p y p 2 = p 3 = -fe, 

and p A = ~T6 is given by the following: 

4 (N • — /jp?) 2 

Reject if and only if 2° = Y —-—exceeds xj-Jk) = * 2 95 (3) = 

i npj 


7.81. 


The observed Q 3 is 

(315 — 312.75) 2 , (108 - 104.25) 2 _ (101 - 104.25) 2 , (32 - 34.75) 2 
312.75 + 104.25 + 104.25 + 34/75 

«.470, 

and so there is good agreement with the null hypothesis; that is, there is a 
good fit of the data to the model. //// 


Theorem 8 can be generalized to the case where the probabilities pj may 
depend on unknown parameters. The generalization is given in the next 
theorem. 


Theorem 9 Let the possible outcomes of a certain random experiment 
be decomposed into k + 1 mutually exclusive sets, say A u ..., A k+1 . 
Define Pj = P[Aj\, j = 1, ..., k+ 1, and assume that pj depends on r 
unknown parameters 9 U 0 r , so that p } = 9 r ), j = 1, ..., 

k + 1. In n independent repetitions of the random experiment, let Nj 
denote the number of outcomes belonging to set Aj,j = 1, ..., k + 1, so 

k + i 

that ]T Nj = n. Let ©j, ..., © r be BAN estimators (e.g., maximum- 
J -1 

likelihood estimators) of 9 l , 0 r based oniVj, ..., N k . Then, under 
certain general regularity conditions on the /t/s. 


j= i n " j 


has a limiting distribution that is the chi-square distribution with k — r 
degrees of freedom, where Pj = //©i, ..., @ r ), j = 1, . .., k + 1- //// 
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The proof of Theorem 9 is beyond the scope of this book. The limiting 
distribution given in Theorem 9 differs from the limiting distribution given in 
Theorem 8 only in the number of degrees of freedom. In Theorem 8 there are k 
degrees of freedom, and in Theorem 9 there are k — r degrees of freedom; the 
number of degrees of freedom has been reduced by one for each parameter that 
is estimated from the data. 

No mention of hypothesis testing is made in the statement of Theorem 9. 
However, we will show now how the results of the theorem can be used to obtain 
a goodness-of-fit test. Suppose that it is desired to test that a random sample 
Aj, ..., X n came from a density f(x; 0 k , ..., 6 r ), where 0 lf ..., 0 r are unknown 
parameters but the function / is known. The null hypothesis is the composite 
hypothesis Jf 0 : X, has density/(x; 0 U ..., 0 r ) for some 0 U 0 r . The null 
hypothesis states that the random sample came from the parametric family of 
densities that is specified byf(-;9 lt ..., 0 r ). If the range of the random variable 
X, is decomposed into k + 1 subsets, say A lt ..., A k+1 , if pj = P[X t e Aj], and if 
Nj = number of Aj’s falling in Aj, then, according to Theorem 9, 



is approximately distributed as the chi-square distribution with k — r degrees of 

freedom if n is large and is true, where Pj = .© r ) and 0, is a 

maximum-likelihood estimator of 0,, i = 1, ..., r, obtained from the statistics 
Aj, ..., N k . {Note that /t J (0 l , ..., 0 r ) = P[X t e Aj], which for a continuous 
random variable Aj equals § Aj f(x; 0 U ■ ■ ■, 0 r ) dx.} Hence, a test of can be 
obtained by rejecting if a nd only if the statistic Q' k is large; that is, reject 
Jf’o if and only if Q' k exceeds xf-*(k — r), where Xi-Jk — r) is the (1 — a)th 
quantile of the chi-square distribution with k — r degrees of freedom. Such a 
test is called a goodness-of-fit test since it tests whether or not the observations 

. x„ fit, or are consistent with, the assumption that they are observations 

from the density f(x; 0 lt 0 r ). 

In the above, the 0,, for i = 1,..., r, were estimated by using the statistics 

N k . N k rather than Aj,..., X n . The statistics N u ...,N k give the number 

of x observations falling in each of the Aj subsets or groups. In practice, often 
the values of the Aj’s are not recorded, and then the group totals Aj, ..., N k 
constitute the available information. If, however, the observations Aj, ..., Aj 
were available, then one could estimate 9 t ,i= 1,..., r, more efficiently by using, 
say, maximum-likelihood estimators based on Aj ..., Aj. When such esti¬ 
mators are used, the limiting distribution of Q k is no longer a chi-square distribu¬ 
tion with k — r degrees of freedom; instead, the limiting distribution of Q' k is 
bounded between a chi-square distribution with k — r degrees' of freedom and a 



448 TESTS OF HYPOTHESES 


IX 


chi-square distribution with k degrees of freedom. In a sense, some of the 
“lost” r degrees of freedom are recouped by efficiently estimating 0 U .0 r . 
For a proof of Theorem 9 and further discussion of the above, the reader is 
referred to Kendall and Stuart [14]. 


EXAMPLE 22 Suppose it is desired to test the hypothesis that an observed 
random sample x lt ..., x„ has been drawn from some normal population. 
Let the n sample values x lt ..., x„ be grouped into k + 1 classes. For 
example, theyth class could be taken as all those observations falling in the 
interval ( z J _ 1 , zj\, j = 1, ..., k + 1, for some z 0 < z 1 < z 2 < ■ ■ ■ <z k 
< z k+1 , where z 0 = — oo and z k+1 = + oo. Then 


Pj = Ml 1 ’ ff2 ) = / A*) dx = ® ~ ° (~^—~ )' 


Let fi and & be the maximum-likelihood estimates of n and a based on 
n k , where rtj is the number of observations falling in theyth interval. 

Then, 




can be determined from the sample, and so the value 


q'k = 


k + l 


l 


(nj ~ npj) 2 
npj 


of Q k can also be obtained from the sample. The hypothesis that the 
sample came from a normal population would be rejected at the a level'if 
q k > xi-a(k — 2). If, on the other hand, n and a were obtained from 

maximum-likelihood estimators based on . X n , then the asymptotic 

distribution of Q' k would fall between a chi-square distribution with k — 2 
degrees of freedom and a chi-square distribution with k degrees of freedom. 
The hypothesis would be rejected if q' k > C, where C falls between 
xi-Jk — 2) and xX -#)• Note that for k large there is little difference 
between xl-Jk - 2) and xf -*(k). //// 


5.3 Test of the Equality of Two Multinomial 
Distributions and Generalizations 

A problem that is of great practical importance is that of testing whether several 
random samples can be considered as drawn from the same population. For 
instance, in Subsec. 4.3 we tested whether several assumed normal populations 
could be considered the same normal population. In this subsection we first 
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indicate a test of the hypothesis that two multinomial populations can be con¬ 
sidered the same and then indicate some generalizations. Suppose that there are 
k + 1 groups associated with each of the two multinomial populations. Let the 
first population have associated probabilities p n , p 12 , ..., Pik > Pi,k+i an d the 
second p 21 , p 22 , ..., P 2k , Pi.k+i- It is desired to test 0 : Pij = P 2 j(=Pj> sa y)» 
y=l, ..., k + 1. For a sample of size n, from the first population, let N yJ 
denote the number of outcomes in group j,j = 1, ..., k + 1. Similarly, let N 2J 
denote the number of outcomes in group j of a sample of size n 2 from the second 
population. (Here we are assuming that the sample sizes and n 2 are known.) 
We know that 

fc y (■ Njj-njptj ) 2 

j=l Hi Pij 

has a limiting chi-square distribution with k degrees of freedom for i = 1 and 2; 
hence 


£ * +1 (N tJ - n lPlj f 
(=i j'= i Hi Pij 


has a limiting chi-square distribution with 2k degrees of freedom if the two 
random samples are independent. If Jf 0 is true, then 


62* — 


I I 


i=l J=i 


(■Njj ~ n;pj ) 2 
n iPj 


(29) 


has a limiting chi-square distribution with 2k degrees of freedom. If ^f 0 
specifies the values pj , then Q 2k is a statistic and can be used as a test statistic. 
On the other hand, if the pj defined by 0 are unknown, then they have to be 
estimated. If i s true - the two samples can be considered as one random 
sample of size n i + n 2 from a multinomial population with probabilities p u ..., 
p k+l . Maximum-likelihood estimators of the p } are then (N 1J + N 2J )/(n 1 + n 2 ), 
j = 1, • • •, k, and if the pj in Eq. (29) are replaced by their maximum-likelihood 
estimators, we then obtain 


tkk n i {N lJ + N 2J )!{n 1 + n 2 ) ' ( ’ 

It can be shown that Q 2k has a limiting chi-square distribution with 2 k — k = k 
degrees of freedom. (This result is not a direct corollary of Theorem 9; it 
would, however, be a corollary of a generalization of Theorem 9 from one to 
two populations.) Again the degrees of freedom of the limiting distribution of 
Q' 2 k have been reduced by unity for each parameter estimated. 
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Another test of the homogeneity of two multinomial populations can be 
derived by finding the generalized likelihood-ratio A and employing Theorem 7 
to obtain the limiting distribution of — 2 log A. (Reparameterization is required 
before Theorem 7 can be employed directly.) The details of finding such a test 
are left as an exercise. 


EXAMPLE 23 In an opinion survey regarding a certain political issue there 
was some question as to whether or not the eligible voters under 25 years 
of age might view the issue differently from those over 25. Fifteen hundred 
individuals of those over 25 were interviewed, and 1000 of those under 25 
were interviewed with the following results (the data are obviously arti¬ 
ficial to facilitate calculations): 



Opposed 

Undecided 

Favor 

Total 

Under 25 

400 

100 

500 

1000 

Over 25 

600 

400 

500 

1500 

Total 

1000 

500 

1000 

2500 


Test the null hypothesis that there is no evidence of difference of opinion 
due to the different age grouping; that is, test 0 : p yj = p 2j = p Jt j = 1, 
2, 3. Pi and p 2 need to be estimated. We can calculate the value of the 
statistic given in Eq. (30) as follows: 

(400 - 1000 • 1000/2500) 2 r (100 - 1000 • 500/2500) 2 
1000 • 1000/2500 + 1000 • 500/2500 

, (500 - 1000 • 1000/2500) 2 , (600 - 1500 • 1000/2500) 2 
+ 1000- 1000/2500 + 1500 • 1000/2500 

, (400 - 1500 • 500/2500) 2 _ (500 - 1500 • 1000/2500) 2 
+ 1500 - 500/2500 + 1500- 1000/2500 ~ 125 ‘ 

The 99 percent quantile point for the chi-square distribution with two 
degrees of freedom is only 9.21; so there is strong evidence that the twoage 
groups have different opinions on the political issue. //// 

The technique presented in this subsection can be generalized in two direc¬ 
tions. First, a test of the homogeneity of several, rather than just two, multi¬ 
nomial populations can be obtained, and, second, a test of the hypothesis that 
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several given samples are drawn from the same population of a specified type 
(such as the Poisson, the gamma, etc.) can be obtained using a procedure similar 
to that above. We illustrate with an example. 


EXAMPLE 24 One hundred observations were drawn from each of two Poisson 
populations with the following results: 



0 

1 

2 

3 

D 

5 


n 

8 

9 or more 

Total 

Population 1 

11 

25 

28 

20 

9 

3 

3 

0 

1 

0 

100 

Population 2 

13 

27 

28 

17 

11 

1 

2 

1 

0 

0 

100 

Total 

24 

52 

56 

37 

20 

11 


200 


Is there strong evidence in the data to support the contention that the two 
Poisson populations are different? That is, test the hypothesis that the 
two populations are the same. This hypothesis can be tested in a variety 
of ways. We first use the chi-square technique mentioned above. We 
group the data into six groups, the last including all digits greater than 4, as 
indicated in the above table. If the two populations are the same, we have 
to estimate one parameter, namely, the mean of the common Poisson 
distribution. The maximum-likelihood estimate is the sample mean, 
which is 

0(24) + 1(52) + 2(56) + 3(37) + 4(20) + 5(4) + 6(5) + 7(1) + 8(1) 

200 


^ = 2 . 1 . 

200 


The expected number in each group of each population is given by 


0 

1 

2 

3 

4 

5 or more 

12.25 

25.72 

27.00 

18.90 

9.92 

6.21 


The value of the statistic in Eq. (29), where n t p } is replaced by the estimates 
given in the above table, can be calculated. It is approximately 1.68. 
The degrees of freedom should be 2k - 1 (one parameter is estimated), 
which is 9. The test indicates that there is no reason to suspect that the 
two assumed Poisson populations are different Poisson populations. //// 
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We mentioned earlier that there are several methods of testing the null 
hypothesis considered here. For example the generalized likelihood-ratio prin¬ 
ciple and employment of Theorem 7 yield a test that the student may find in¬ 
structive to find for himself. 


5.4 Tests of Independence in Contingency Tables 

A contingency table is a multiple classification; for example, in a public opinion 
survey the individuals interviewed may be classified according to their attitude 
on a political proposal and according to sex to obtain a table of the form 



Favor 

Oppose 

Undecided 

Men 

1154 

475 

243 

Women 

1083 

442 

362 


This is a 2 x 3 contingency table. The individuals are classified by two criteria, 
one having two categories and the other three categories. The six distinct 
classifications are called cells. A three-way contingency table would have been 
obtained had the individuals been further classified according to a third criterion, 
say, according to an annual-income group. If there were five income groups 
set up (such as under $2000, $2000 to $4000,...), the contingency table would be 
called a 2 x 3 x 5 table and would have 30 cells into which a person might be 
put. It is often quite convenient to think of the cells as cubes in a block two 
units wide, three units long, and five units deep. If the individuals were still 
further classified into eight geographic locations, one would have a four-way 
(2 x 3 x 5 x 8) contingency table with 240 cells in a four-dimensional block 
with edges two, three, five, and eight units long. A contingency table provides 
a convenient display of the data for ultimately investigating suspected relation¬ 
ships. Thus one may suspect that men and women will react differently to a 
certain political proposal, in which case one would construct such a table as the 
one above and test the null hypothesis that their attitudes were independent of 
their sex. To consider another example, a geneticist may suspect that suscep¬ 
tibility to a certain disease is heritable. He would classify a sample of individuals 
according to (i) whether or not they ever had the disease, (ii) whether or not their 
fathers had the disease, and (iii) whether or not their mothers had the disease. 
In the resulting 2x2x2 contingency table he would test the null hypothesis 
that classification (i) was independent of (ii) and (iii). Again a medical research 
worker might suspect a certain environmental condition favored a given disease 
and classify individuals according to (i) whether or not they ever had the disease, 
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(ii) whether or not they were subject to the condition. An industrial engineer 
could use a contingency table to discover whether or not two kinds of defects in 
a manufactured product were due to the same underlying cause or to different 
causes. It is apparent that the technique can be a very useful tool in any field 
of research. 

Two-way contingency tables We shall suppose that n individuals or items are 
classified according to two criteria A and B, that there are r classifications 
A y , A 2 ,..., A r in A and s classifications By, B 2 ,... , B s in B, and that the number 
of individuals belonging to A ; and Bj is N,j . We have then an r x s contingency 
table with cell frequencies N u and £ N ;J = n: 


(31) 


As a further notation we shall denote the row totals by and the column totals 
by Nji that is, 

W«. = I N tj and Nj = lN u . 

J i 

Of course, 

I N i. = I Nj = n. 

* J 

We shall now set up a probability model for the problem with which we 
wish to deal. The n individuals will be regarded as a sample of size n from a 
multinomial population with probabilities p ;j (i = 1, 2, ..., r ; j = 1, 2, ..., s). 
The probability density function for a single observation is 

/(*11, Xy 2 , X rs ; Pyy, . . . , />„) = [] tfj', (32) 

i.j 

x tJ = 0 or 1 and 

i.j 



where 
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We wish to test the null hypothesis that the A and B classifications are independ¬ 
ent, i.e., that the probability that an individual falls in Bj is not affected by the 
A class to which the individual happens to belong. Using the symbolism of 
Chap. I, we would write 

P[Bj M,] = P[Bj] and P[A i \B J ] = P[A ; ] 
or 


P[A t n Bj] = PlAJPlBjl 

If we denote the marginal probabilities P[A{\ by p t . (i = 1, 2, ..., r) and the 
marginal probabilities P[Bj] by p j (j = 1 , 2, ..., s), the null hypothesis is simply 

^o- Ptj = Pi.Pj, £ Pi. = U £ P.J = l - ( 33 ) 

When the null hypothesis is not true, there is said to be interaction between the 
two criteria of classification. 

The complete parameter space 0 for the distribution of N llt .N rs has 
rs — 1 dimensions (having specified all but one of the p tJ , the remaining one is 
fixed by £ p tj = l), while under we have a parameter space © 0 with 

O 

r — 1 + s — 1 dimensions. (The null hypothesis is specified by p L , i = 1 
and pj, j — 1, ..., s, but there are only r — 1 + s — l dimensions because 
£ Pi. = 1 and £ p j = 1.) The likelihood for a sample of size n is 

l = TUW 

i.J 

and its maximum in 0 occurs when 


In 0 O , 

L 

and its maximum occurs at 


The generalized likelihood-ratio 


nv j 

t,j 


pu=^- 

n 


= n(Pi.p.jr iJ = (l\Pir){npy), 


(34) 


a "«• 

^■ = 7T 

is therefore 


and pj = — • 




(35) 


n 


( 36 ) 
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The distribution of A under the null hypothesis is not unique because the hy¬ 
pothesis is composite and the exact distribution of A does involve the unknown 
parameters p; and p^; hence, it is very difficult to solve for/ 0 in supPJA < 2 0 ] = 

a. For large samples we do have a test, however, because —2 log A is in that 
case approximately distributed as a chi-square random variable with 

rs — 1 — (r + s — 2) = (r — l)(,y — 1) 

degrees of freedom and on the basis of this distribution a unique critical region 
for X may be determined. The degrees of freedom rs — 1 — (r + s — 2) is 
obtained by subtracting r + s — 2, which is the dimension of 0 O , from rs - 1, 
which is the dimension of 0. Also, (r - l)(,y — 1) is the number of parameters 
specified by . (See Theorem 7 and the comment following it.) Actually, 
the null hypothesis Jf 0 ; Ptj = Pt.Pj is not of the form required by Theorem 7; 
so it might be instructive to consider the necessary reparameterization. For 
convenience, let us take r = s = 2. Now 0 = {(0!, 0 2 , 0 3 ) = (p n , p 12 , p 21 ): 
Pn >0; Pi 2 ^°i P 21 S: 0 ; and p u + p J2 + p 21 ^ 1}. Let 0' with points 
(0\ , 0' 2 ,0' 3 ) denote the reparameterized space, where 0 \ = p n - p, p j, 0 2 = p, , 
and 0 3 = p A . It can be easily demonstrated that 0' is a one-to-one transforma¬ 
tion of 0. Also, the null hypothesis .?f 0 : p n = Pi.p. 1( p 12 = pj.(l - p.i), and 
p 21 = (1 - p,)p.i in the original parameter space 0 becomes J^” 0 : 0[ = 0 and 
0 2 and 0' 3 unspecified in the reparameterized space ©'. [Note that p 12 = 
Pi XI — p j) is equivalent to p 3 — p u = p t ( 1 — p A ), which is equivalent to p u — 
Pi.P.i =0. Similarly for p 21 = (1 — PiJp.i-] J^' 0 is of the form required by 
Theorem 7. In general, a point in the (rs — l)-dimensional parameter space 0 
can be conveniently displayed as 


Pll 

Pi 2 


Pl,s-1 

Pu 

Pu 

P22 

. 

P2,s-1 

p 2s 

Pr- 1,1 

Pn 

Pr- 1,2 

Pr 2 


Pr-l,s-l 
Pr,s- 1 

Pr-l,s 


and a point in the reparameterized space 0' can be displayed as 


Pll ~ Pl.P.l 

Pl2 Pi.P.2 


Pl.s-1 ~ Pl.P..s-l 

Pi. 

P21 - P2.P.1 

P22 ~ P2.P.2 

' ' ' 

P 2 , s— 1 ~ P2.P..S-1 

P2. 

Pr- 1,1 - Pr— 1 ,.P.1 

P.I 

Pr- 1,2 — Pr— l, .P.2 
P.2 


Pr-l.s-l- Pr-l..P.,s-l 
P..S-1 

Pr-1.. 
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In casting about for a test which may be used when the sample is not large, 
we may inquire how it is that a test criterion comes to have a unique distribution 
for large samples when the distribution actually depends on unknown param¬ 
eters which may have any values in certain ranges. The answer is that the 
parameters are not really unknown; they can be estimated, and their estimates 
approach their true values as the sample size increases. In the limit as n becomes 
infinite, the parameters are known exactly, and it is at that point that the dis¬ 
tribution of A actually becomes unique. It is unique because a particular point 
in 0 O is selected as the true parameter point, so that the N tJ are given a unique 
distribution, and the distribution of A is then determined by this distribution. 

It would appear reasonable to employ a similar procedure to set up a test 
for small samples, i.e., to define a distribution for A by using the estimates for 
the unknown parameters. In the present problem, since the estimates of the 
p t and P j are given by Eq. (35), we might just substitute those values in the 
distribution function of the N t j and use the distribution to obtain a distribution 
for A. However, we should still be in trouble; the critical region would depend 
on the marginal totals AT,, and N tJ ; hence the probability of a Type I error would 
vary from sample to sample for any fixed critical region 0 < A < A 0 . 

There is a way out of this difficulty, which is well worth investigation 
because of its own interest and because the problem is important in applied 
statistics. Let us denote the joint density of all the N u briefly by f(n tj ), the 
marginal density of all the N tn and N j by g(n, , n j), and the conditional density 
of the N tJ , given the marginal totals, by 


f(n tJ \n L , rtj) = 


f( n tj) 
9("t. > n.j) 


Under the null hypothesis, this conditional distribution happens to be inde¬ 
pendent of the unknown parameters (as we shall show presently); the estimators 
N t Jn and Nj/n form a sufficient set of statistics for the p L and P j. This fact 
will enable us to construct a test. 

The joint density of the N tJ is simply the multinomial distribution 


=/(«!!> 


‘12 


, . . . , 2 ! rs ) 


n ! 


nvO«' 


i ,1 


(37) 


in 0, and in © 0 (we are interested in the distribution of A under .?f 0 ) this becomes 


/(«n> n n ."j=j=j^y (n k-) (n py\ os) 


To obtain the desired conditional distribution, we must first find the distribution 
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of the N L and N mJ , and this is accomplished by summing Eq. (38) over all sets 


of n tJ such that 


X n u = n .j and X n ij = «». • ( 39 ) 

t j 


For fixed marginal totals, only the factor * n Eq- (38) is involved in 

the sum; so we have, in effect, to sum that factor over all n tJ subject to Eq. (39). 
The desired sum is given by comparing the coefficients of £[ x?'- in the expression 


(*i + • • • + x r ) tt - , (x 1 +■■■ + x r )"- 2 • ••(x, + • • • + x r )"- s = (Xj + • • • + x r ) n . 
On the right-hand side the coefficient of £[ x"‘- is simply 

«! 

EK*' 

i 

On the left-hand side there are terms with coefficients of the form 

«.ii n 2 l nj 

n n n 1 n n i2 \ n«.v' nv 


(40) 

(41) 


(42) 


where n i} is the exponent of x f in the jth multinomial. In this expression the 
tiij satisfy conditions of Eq. (39); the first condition is satisfied in view of the 
multinomial theorem, while the second is satisfied because we require the ex¬ 
ponent of Xi in these terms to be . The sum of all such coefficients, Eq. (42), 
must equal Eq. (41); hence, we may write 


I 


1 

n v 


n\ 

(ri «;.! n ) 


(43) 


This is precisely the sum that we require because there is obviously one and only 
one coefficient of the form of Eq. (42) on the left of Eq. (40) for every possible 
contingency table, Eq. (31), with given marginal totals. The distribution of the 
N l and N.j is, therefore, 


#(«;.. n.j) = 


(n\f 


cn «f. ixn«./) 


(Upniim (44) 


which shows, incidentally, that the N t . are distributed independently of the Ma¬ 
under ./tf 0 ; this is unexpected because and N A , for example, have the 
random variable N 11 in common! 

The conditional distribution of the N tJ , given the marginal totals, is 
obtained by dividing Eq. (38) by Eq. (44) to obtain 


/(« 11 > « 12 , • • •, n rs I «!., n 2 .. n j 


nlFI",;! 


(45) 
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X 



which, happily, does not involve the unknown parameters and shows that the 
estimators are sufficient. 

To see how a test may be constructed, let us consider the general situation 
in which a test statistic A for some test has a distribution / a (A; 9) which involves 
an unknown parameter 0. If 9 has a sufficient statistic, say T, then the joint 
density of A and T may be written 


/a,t(^> U 0) — /a|t(^I 0/r(U 9), 

and the conditional density of A, given T, will not involve 9. Using the condi¬ 
tional distribution, we may find a number, say X 0 (t), for every t such that 



/a|t(^I 0 dk — .05, 


(46) 


for example. In the Xt -plane the curve X = X 0 (t) together with the line X = 0 
will determine a region R. See Fig. 7. The probability that a sample will give 
rise to a pair of values ( X , t) which correspond to a point in R is exactly .05 
because 

/ CO -Ao (<) 

/a.t (X,t;9)dXdt 

-oo J 0 

f oo r r A 0 (0 "| 

= J [J o A\r(m dXy T (t;9)dt 
= f .05 f T (t;9)dt 

J - oo 


= .05. 


Hence we may test the hypothesis by using T in conjunction with A. The 
critical region is a plane region instead of an interval 0 < X < X 0 ; it is such a 
region that, whatever the unknown value of 9 may be, the Type I error has a 
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specified probability. The test in any given situation actually amounts to a 
conditional test; we observe T and then perform the test by using the interval 
0 < X < X 0 (t) using the conditional distribution of A, given T. It is to be observed 
that this device cannot be employed unless there is a sufficient statistic for 9. 

The above technique is obviously applicable when 9 is a set of parameters 
rather than a single parameter and has a set of sufficient statistics. In particular, 
the technique may be employed to test the null hypothesis of a two-way con¬ 
tingency table using Eq. (36) to define X. One merely uses the conditional 
distribution of Eq. (45) and determines an interval 0 < X < X 0 (n L ; n.j) which has 
the desired probability of a Type I error for the observed marginal totals. 

In applications of this test one is confronted with a very tedious computa¬ 
tion in determining the distribution of A unless r, s, and the marginal totals are 
quite small. It can be shown, however, that the large-sample approximation 
may be used without appreciable error except when both r and s equal 2. In 
the latter instance, other simplifying approximations have been developed (see, 
for example, Fisher and Yates, “Tables for Statisticians and Biometricians,” 
Oliver & Boyd Ltd., Edinburgh or London, 1938), but we shall not explore the 
problem that far. 

Another test of the 0 given in Eq. (33) is obtained if the distribution in 
Eq. (45) is replaced by its multivariate normal approximation since then it can 
be shown that the statistic 


[Nij — n(N ; Jn)(N'j/ri)] 2 
uj n(NJn)(Nj/n ) ’ 

has approximately the chi-square distribution with rs-l - (r — \ + s — 1) = 
(r — l)(,y — 1) degrees of freedom. The test criterion is to reject for large Q. 
This is the criterion first proposed (by Karl Pearson) for testing the hypothesis, 
and it differs from —2 log X by terms of order 1 j^fn. The two criteria are 
therefore essentially equivalent unless n is small. The argument that Q is a 
reasonable test statistic is entirely analogous to that used in Subsec. 5.2 above to 
justify Eq. (25). The statistic Q of Eq. (47) has intuitive appeal. N i} is the 
observed number in the i/th cell, and n(N iJn)(N Jn) is an estimator of the expected 
number in the i/th cell when 0 * s true. Thus, Q will tend to be small for ^f 0 
true and large for false. 

Three-way contingency tables If the elements of a population can be clas¬ 
sified according to three criteria A, B, and C with classifications A ; (i = 1,2,..., 
.Si), Bj(J= 1,2,..., ,y 2 ), and C k (k= 1, 2,..., .y 3 ), a sample of n individuals may 
be classified in a three-way x s 2 x contingency table. We shall let p iJk 
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represent the probabilities associated with the individual cells and N IJk be the 
numbers of sample elements in the individual cells, and, as before, marginal 
totals will be indicated by replacing the summed index by a dot; thus 


= I *ij k 


and 


N.. k = I £N tJk . 


= 1 j=i 


(48) 


There are four hypotheses that may be tested in connection with this table. 
We may test whether all three criteria are mutually independent, in which case 
the null hypothesis is 

Pijk = Pt..P.j.P..k> (49) 


where p,__ = £ £ p iJk , pj = X X Pijk> and p..* = £ £ Pijk', or we may test 

j k i k i j 

whether any one of the three criteria is independent of the other two. Thus to 
test whether the B classification is independent of A and C, we set up the null 
hypothesis 


Pijk = Pt.kP.j., (50) 

where p Lk = £ p iJk . 
j 

The procedure for testing these hypotheses is entirely analogous to that 
for the two-way tables. The likelihood of the sample is 


where 


l = n m (5i) 

i.j. k 


I Pijk = 1 

i,J, k 


In 0 the maximum of L occurs when 


so that 


and X n ijk = «• 

i, j, k 


ft - n ‘ Jk 

Pijk —-> 

n 


sup L = “ n n "jl k - (52) 

e n t,j, k 


To test the null hypothesis in Eq. (50), for example, we make the substitution 
of Eq. (50) into Eq. (51) and maximize L with respect to the p i k and p. y . to find 


and 


n i.k 

Pt.k = — and 
n 


Pj. = — ’ 
n 


sup L = 

So 


1 

Zln 


(lH(n »;;)• 


(53) 
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The generalized likelihood-ratio X is given by the quotient of Eqs. (52) and (53), 
and in large samples - 2 log A has the chi-square distribution with 

SiS 2 S 3 - 1 - [(51*3 - 1) + S 2 - 1] = (SjS 3 - l)(s 2 - 1) 

degrees of freedom. Again the large-sample distribution is quite adequate for 
many purposes. (s t s 3 — 1) + (s 2 — 1) is the dimension of 0 O , and s k s 2 s 3 — 1 is 
the dimension of 0. 

A test statistic analogous to that given in Eq. (47) for testing independence 
in a 2 x 2 contingency table can also be derived. For testing ^f 0 : A and C 
classifications are independent of the B classification, such a test statistic is 

[N iJk - 

tjk n (N tk /n)(Nj J ri) 

Under Jtf’o, Q has an asymptotic chi-square distribution with s k s 2 s 3 — 1 
- (s k s 3 - 1) - (s 2 - 1) = (5 t 5 3 - l)(s 2 — 1) degrees of freedom. Again, the 
statistic Q of Eq. (54) has intuitive appeal since N tJk is the observed number in 
cell ijk and n{N l k jrt){Njjn) is an estimator of the expected number when ^f 0 
is true. 

6 TESTS OF HYPOTHESES AND 
CONFIDENCE INTERVALS 

In Subsec. 3.4 above we noted that a confidence interval for a unidimensional 
parameter 0 could be used to obtain a test of XC 0 : 0 = 9 0 versus 3 : 9 # 9 0 . 
In this section we will further explore that concept and show that one can 
reverse the operation; that is, one can use a family of tests of Jf 0 : 9 = 9 0 versus 
: 9 # 0 O (the family is generated by varying 9 0 ) to obtain a confidence 
interval for 9. Our considerations in this section will not be very thorough; our 
intent is merely to present an introduction to the usefulness of the close relation¬ 
ship between hypothesis testing and confidence intervals. 

Our discussion can be made somewhat more general if we speak in terms 
of confidence sets rather than confidence intervals. As usual, let X denote the 
sample space, 0 the parameter space, and {x u x „) the observed sample. 

Definition 18 Confidence set A family of subsets of the parameter 
space B indexed by (x 1; ..., x„) e X, denoted by 3 = {0(x 1 , ..., x „): 
Q(x l ,..., x„) c ©; (x lt ..., x„) e it'}, is defined to be a family of confidence 
sets with confidence coefficient y if and only if 

P 9 [0(T 1 , X n ) contains 9] = y for all 9 e 0. (55) 

IIII 
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It should be emphasized that any member, say 0(x 1; ..., x„), of the family 

of confidence sets is a subset of 0, the parameter space. 0(A). X„) is a 

random subset; for any possible value, say (x lt ..., x„), of (X 1 .A)), 

0(A).A),) takes on the value 0(x 1; .... x„), a member of the family 0. To 

aid in the interpretation of the probability statement in Eq. (55), note that for a 
fixed (yet arbitrary) 0 “0(A), X„) contains 0” is an event [it is the event that 
the random interval 0(1), X n ) contains the fixed 0] and the 0 that appears 
as a subscript in P e is the 0 that indexes the distribution of the A)’s appearing 
in 0(A). X n ). 

For instance, suppose A). X„ is a random sample from N(9, 1). 

0 = {0: — oo < 0 < oo}. Let the subset 0(x t . x n ) be the interval 

(x — z/yfn, x + zj-Jn), where z is given by <fi(z) — <b(—z) = y; then the family of 
subsets 0 = {0(x t , ..., x„): 0(x t , ..., X„) = (x — z/^Jn, x + z/^/w)} is a family 
of confidence sets with a confidence coefficient y since 


P a [0(A), ...,!'„) contains 0] 


z 

= Pg X--= 
L Jn 


< 0 < X + —= 

V" v w - 

r x-o i 
- P °[~ Z< Ti7n <Z \- y 


for all 0 e 0. 


The family 0 is a family of confidence intervals for 0 having a confidence 
coefficient y. In general, then, a confidence interval is an example of a 
confidence set. 

Confidence sets can be constructed from tests of hypotheses, as we now 
show. Let T flo be a size-a test (nonrandomized) of the null hypothesis Jt 0 : 
0 = 0 O , and let X(0 Q ) be the acceptance region of the test Y„ 0 . [The acceptance 
region is the set complement of the critical region; that is, if the critical region is 
given by C(0 O ), then X(0 o ) = X — C(0 O ).] Note that X(0 o ) is a subset of X 
indexed by 0 O . Since the test Y So has size a, 

Fj(Y 1 ,...,Y n )eY(0 o )] = l -a. 

If we now vary 0 O over 0 and for each 0 O we have a test T„ 0 , then we get a 
family of acceptance regions, namely, (T(0 O ): 0 O e 0}. X(0 o ) is the acceptance 
region of test T flo . One can now define 

©(xj, ...,*„) = {0 O : (x,, . x„) e X(9 0 )}. (56) 

Clearly 0(x 1; ..., x„) is a subset of 0. Furthermore, the family {©(Xj.x„)} 

is a family of confidence sets with a confidence coefficient y = 1 — a since one has 
{0(A),..., XJ contains 0 O } if and only if one has {(A). X n ) e X(0 o )}, and so 

PeM( x i’ ■ ■ ■ > x n) contains 0 O ] = P Bo [( x i . x ,.) e 3E(^o)] = 1 - «• 
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EXAMPLE 25 Let X u ..., X n be a random sample from N(0, 1), and consider 
testing J^ 0 : 9 = 6 0 . A test with size a is given by the following: Reject 
J^ 0 if and only if | x - 9 0 \ > zjjn, where z is defined by <t>(z) - (t>( -z) = 
1 — a. The acceptance region of this test is given by 

Wo) = ((*i, ■ • x„): 9 0 - -^= < x < 9 0 + -^=). 

I jn -Jn) 

We can now define, as in Eq. (56), 

0(*i, ..., *„) = {0 O : (*i. *«) e Wo)} 

= \0 o : 9 0 ~ A= < x < 9 0 + -W 
\ V n yjn) 

= {0 O : x — Z -j=<9 0 <x 

1 V n V "' 

0(x t , .x„) is a confidence set (in fact a confidence interval) with a 
confidence coefficient y = 1 — a. //// 


The general procedure exhibited above shows how tests of hypotheses can 
be used to generate or construct confidence sets. The procedure is reversible; 
that is, a given family of confidence sets can be “reverted” to give a test of 
hypothesis. Specifically, for a given family {0(x 1; ..., x„)} of confidence sets 
with a confidence coefficient y, if we defined 

Wo) ={(*!> x„): e 0 e §(*!.x„)}, (57) 

then the nonrandomized test with acceptance region Wo) is a test of ,?f 0 : 9 = 9 0 
with size a = 1 — y. 

The usefulness of the strong relationship between tests of hypotheses and 
confidence sets is exemplified not only in the fact that one can be used to construct 
the other but also in the result that often an optimal property of one carries over 
to the other. That is, if one can find a test that is optimal in some sense, then 
the corresponding constructed confidence set is also optimal in some sense, and 
conversely. We will not study the very interesting theoretical result alluded to 
in the previous sentence, but we will give the following in order to give some idea 
of the types of optimality that can be expected. (See the more advanced books 
of Ref. 16 and Ref. 19 for a detailed discussion.) An optimum property of 
confidence sets is given in the following definition. 
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Definition 19 Uniformly most accurate A family {©*(x 1; ..., x„)} of 
confidence sets with a confidence coefficient y is defined to be a uniformly 
most accurate family of confidence sets at a confidence coefficient y if for 
any other family {©(x^ ..., x„)} of confidence sets with a coefficient y 

P<i[©*(Yi, • • •, Y„) contains O'] < P e [Q.{X l , ..., X n ) contains 9'] 

for all 9 and O'. HI I 

Definition 19 is saying that 0*(Y,, ..., X n ) is less likely to contain an 

incorrect O' than is Q(X 1 . X„), whereas both &*(X l ,..., X„) and Q(X i . X n ) 

have the same probability of containing the correct 0. As you may have guessed, 
uniformly most accurate confidence sets rarely exist. However, uniformly most 
accurate confidence sets within restricted classes of confidence sets could also be 
defined, and then one could be hopeful of the existence of such optimal confidence 
sets. A general type of result that derives from the close relationship between 
tests of hypotheses and confidence sets is the following: If T* is a uniformly most 
powerful size-a test of : 0 = 0 o within some restricted class of tests, then the 
confidence set corresponding to Y* is uniformly most accurate with coefficient 
y = 1 — a within some restricted class of confidence sets. With such a result 
one can see how an optimality of a test can be transferred to an optimality of a 
corresponding confidence set, and therein lies the real utility of the close rela¬ 
tionship between hypotheses testing and confidence sets. 

7 SEQUENTIAL TESTS OF HYPOTHESES 
7.1 Introduction 

Sequential analysis refers to techniques for testing hypotheses or estimating 
parameters when the sample size is not fixed in advance but is determined during 
the course of the experiment by criteria which depend on the observations as they 
occur. In this section we propose to consider, and then only briefly, one form 
of sequential analysis, namely, the sequential probability ratio test. 

In Sec. 2 above we considered testing the simple null hypothesis : 0 = 0 o 
versus the simple alternative hypothesis .^ 1 : 0 = 0 1 . It was shown (Neyman- 
Pearson lemma) that for samples of fixed size n, the test which minimized the 
size, say /?, of the Type II error for fixed size, say a, of the Type I error was a 
simple likelihood-ratio test. That is, for fixed n and a, /? was minimized. 
Suppose now that it is desired to fix both a and ft in advance and then find that 
simple likelihood-ratio test having minimum sample size n and having size of 
Type I error equal to a and size of Type II error /?. The solution of such a 
problem is illustrated in the following example. 
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EXAMPLE 26 A manufacturer of a certain component, say, an oil seal, knows 
the history of his current manufacturing process. He knows, for instance, 
that the distribution of lifetimes of the seals now being manufactured is, 
say, 1V(100, 100). A new manufacturing process is suggested; the manu¬ 
facturer wants to continue with his present manufacturing process if the new 
process is not better (longer mean lifetime), yet he also wants to be quite 
certain to switch to the new process if the new process increases the mean 
lifetime by, say, 5 percent. He proposes to take a sample of observations 
of lifetimes of seals made by the new process and then from the sample 
decide whether or not the process has longer mean life. He models the 
experiment by assuming that the random variable X, representing the life¬ 
time of a seal manufactured using the new process, is distributed as 
N(9, 100), and he wants to test Xf 0 \ 0 < 100 versus : 0 > 100. He 
fixes his error sizes and wants to determine the sample size n so that, say, 

.01 = a = P e=100 [reject and .05 = /? = / > e=105 [accept 3f 0 ]. 

That is, he seeks to determine n so that there is only a 1 percent chance of 
rejecting that the new process is no better than the old when it is not, yet 
there is a 95 percent chance of rejecting that the mean lifetime of the new 
process is less than 100 when in fact it is 5 percent larger. It can be shown 
that the simple likelihood-ratio test is equivalent to the test of rejecting 
X ?o f° r large X n . Thus he seeks to determine n and k so that 

.01 = f > e = ioo[2f« > k\ and .05 = Pe = ios\X n < k\ 
or 



implies 


k- 100 

10/v^ 


2.326, 



and 
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implies 


k - 105 
10 tsfn 


1.645; 


which together imply that 100 + 10(2.326)/ v /« ss 105 — 10(1.645)/^/«) or 
n « 63.08; so a sample of size 64 is needed. HI I 


Referring to the above example, the following considerations make se¬ 
quential analysis interesting both from the theoretical and practical viewpoint. 
In drawing the 64 observations to test > it is possible that among the first 
few observations, say 20, 30, or 40, the evidence is quite sufficient relative to a 
and /? for accepting or rejecting ^f 0 , and then observing additional observa¬ 
tions would be a waste of time and effort. In other words, the possibility is 
raised that, by constructing the test in a fashion which permits termination of 
the sampling at any observation, one can test 34? 0 with fixed error sizes a and fi 
and yet do so with fewer than 64 observations on an average. This is in fact 
the case; although it may at first appear surprising in view of the fact that the 
best test for fixed sample size requires 64 observations. The saving in observa¬ 
tions is often quite large, sometimes as much as 50 percent! We will study such 
a sequential procedure in the remaining subsections. 


7.2 Definition of Sequential Probability Ratio Test 

Consider testing a simple null hypothesis against a simple alternative hypothesis. 
In other words, suppose a sample can be drawn from one of two distributions 
(it is not known which one) and it is desired to test that the sample came from 
one distribution against the possibility that it came from the other. If X u 
X 2 , ... denotes the random variables, we want to test ,?f 0 : ~/ 0 (-) versus 

34? t : X i ~ /[(•). The simple likelihood-ratio test was of the following form: 

Ln 

Reject 34? 0 if 2 = — < k for some constant k > 0. 

A 

The sequential test that we propose to consider employs the likelihood-ratios 
sequentially. Define 


X ], . . . , X m ) 


L*o(x i. 


L 1 (x 1 , 


■ ■, x m ) _ L 0 {m) 
..,x m ) L^m) 


n/ooo 

i= 1 _ 

m 

n/i(*o 

t=i 
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for m = 1,2,..., and compute sequentially X u X 2 . For fixed k 0 and k 1 

satisfying 0 < k 0 < k t , adopt the following procedure: Take observation x t and 
compute Xy; if X 2 < k 0 , reject ^f 0 ; if X s > k u accept ,# 0 ; and if k 0 < X, <k u 
take observation x 2 , and compute X 2 . If X 2 < k 0 , reject J^ 0 ; if X 2 >k 2 , 
accept J^f 0 ; and if k 0 <X 2 <k u observe x 3 , etc. The idea is to continue 
sampling as long as k 0 < Xj < k t and stop as soon as X m <k 0 or X m >k 1 , 
rejecting 3f 0 if X m < k 0 and accepting 0 if X m > k l . The critical region of 

oo 

the described sequential test can be defined as C = |J C„, where 

n= 1 

C n = {(x* 3 ,..., x r ). k G •< Aj*(x 3 , ...» Xj) < k l9 j 1, ,.., n 1, 

X„(xi . x„)<k 0 }. (58) 

A point in C„ indicates that J^ 0 is to be rejected for a sample of size n. Sim- 

oo 

ilarly, the acceptance region can be defined as A = (J A n , where 

n= 1 

A n ={(x j, k 0 < Xfa,Xj) < k u j = 1, .... n - 1, 

. x n ) > k,}. (59) 

Definition 20 Sequential probability ratio test For fixed 0 < k 0 < k u a test 
as described above is defined to be a sequential probability ratio test. I HI 

When we considered the simple likelihood-ratio test for fixed sample size 
n, we determined k so that the test would have preassigned size a. We now 
want to determine k 0 and k y so that the sequential probability ratio test will have 
preassigned a and for its respective sizes of the Type I and Type II errors. 
Note that 

a = P[reject is true] = £ I* L 0 (n) (60) 

n= 1 J C n 

and 

/? = ^[accept o | is false] = £ [ L^n), (61) 

n= 1 J A„ 

where, as before, J Cn L 0 (n) is a shortened notation for j • • • in fo(Xt) dx i . 

c„ b=i J 

For fixed a and ft, Eqs. (60) and (61) are two equations in the two unknowns k 0 
and k t . (Both A n and C„ are defined in terms of k 0 and k t .) A solution of 
these two equations would give the sequential probability ratio test having the 
desired preassigned error sizes a and j?. As might be anticipated, the actual 
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determination of k 0 and k l from Eqs. (60) and (61) can be a major computational 
project. In practice, they are seldom determined that way because a very simple 
and accurate approximation is available and is given in the next subsection. 

We note that the sample size of a sequential probability ratio test is a 
random variable. The procedure says to continue sampling until A„ = 
k n (x u x„) first falls outside the interval (k 0 , k t ). The actual sample size then 
depends on which X;’s are observed; it is a function of the random variables 
X u X 2 , ... and consequently is itself a random variable. Denote it by N. 
Ideally we would like to know the distribution of N or at least the expectation 
of N. (The procedure, as defined, seemingly allows for the sampling to continue 
indefinitely, meaning that N could be infinite. Although we will not so prove, 
it can be shown that N is finite with probability 1.) One way of assessing the 
performance of the sequential probability ratio test would be to evaluate the 
expected sample size that is required under each hypothesis. The following 
theorem, given without proof (see Lehmann [16]), states that the sequential 
probability ratio test is an optimal test if performance is measured using expected 
sample size. 

Theorem 10 The sequential probability ratio test with error sizes a and 
p minimizes both <f[jV| is true] and <$[N\ is true] among all tests 
(sequential or not) which satisfy the following: P[XP 0 is rejected| Jf 0 is 
true] < a, P[J^ 0 is accepted | is false] < P, and the expected sample 
size is finite. //// 

Note that in particular the sequential probability ratio test requires fewer 
observations on the average than does the fixed-sample-size test that has the 
same error sizes. In Subsec. 7.4 we will evaluate the expected sample size for 
the example given in the introduction in which 64 observations were required 
for a fixed-sample-size test with preassigned a and p. 


7.3 Approximate Sequential Probability Ratio Test 

We noted above that the determination of k 0 and k { that defines that particular 
sequential probability ratio test which has error sizes a and P is in general 
computationally quite difficult. The following remark gives an approximation 
to k 0 and k v 


Remark Let k 0 and k t be defined so that the sequential probability 
ratio test corresponding to k 0 and k t has error sizes a and p ; then k 0 and 
k l can be approximated by, say, k' 0 and k \, where 



a 

Y^p 


and 


K 


1 -a 

P 


(62) 
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Proof ^Assume E P[N = n \ = 1 for i = 0, l.j 

= P[reject o | o is true] = V L 0 («) < E koM") 

»=1 J C„ n=l J C„ 

00 r 

= fc 0 X ^l(«) = /c 0 ^[ re j ect ^0 I 3 %\' s true] 

«=1 J c„ 


= Ml - PI 

and hence & 0 > a/(l — /?). Also 
1 - a = P[accept 0 1 Jf 0 is true] 


= £ f M«)> E f fciLi(») 

fi = 1 * A n tt — l * A n 

= /c ^[accept Jf 0 \ Jf l is true] = k l p, 

and hence k t < (1 — a)//?. Note that the approximations k' 0 = a/(l — p) 
and k[ = (1 — a)/p satisfy 

“-T J h* k °< k ^ l jr- k ‘- < 63 > 

//// 

Remark Let a' and P' be the error sizes of the sequential probability 
ratio test defined by and k\ given in Eq. (62). Then a' + P' < a + p. 


proof Let A' and C' (with corresponding A' n and C„') denote the 
acceptance and critical regions of the sequential probability ratio test 
defined by k' 0 and k\. Then 


n=I J C„ 1 - P n=l J C„ 1 - P 


and 


°°r 1 — a 00 r 1—a 

l~ a '= E J ,L 0 («)> —— E J , M«) = ——-j?'; 

«= 1 J A n P /l=l P 

hence a'(l — p) < a(l - p'), and (1 - a )p' < (1 - a ')p, which together 
imply that a'(l -/?) + (1 - a)/?' ^ a(l - j?') + (1 - a')/? or a' + p' < a + p. 

Ill / 



470 TESTS OF HYPOTHESES 


IX 


Naturally, one would prefer to use that sequential probability ratio test 
having the desired preassigned error sizes a and p; however, since it is difficult 
to find the k 0 and k i corresponding to such a sequential probability ratio test, 
instead one can use that sequential probability ratio test defined by k' 0 and k[ of 
Eq. (62) and be assured that the sum of the error sizes a' and ft' is less than or 
equal to the sum of the desired error sizes a and ft. 

7.4 Approximate Expected Sample Size of 
Sequential Probability Ratio Test 

The procedure used in performing a sequential probability ratio test is to continue 
sampling as long as k 0 < A m < k t and stop sampling as soon as A m < k 0 or 
A m >k l . If z ; = log c L/o(*;)//i(*;)L an equivalent test is given by the following: 

m 

Continue sampling as long as log e k 0 <Y J z i < log e k t , and stop sampling as soon 

i 

m m 

as X Zj < log e k 0 (and then reject 0 ) or X z t > log c k t (and then accept ^ 0 ). 

i i 

As before, let N be the random variable denoting the sample size of the sequential 
probability ratio test, and let Z ; = log e [/ 0 (A' i )//j(A r i )]. Equation (64), given in 
the following theorem, is useful in finding an approximate expected sample size 
of the sequential probability ratio test. 

Theorem 11 Wald’s equation Let Z t , Z 2 , Z„, ... be independent 

identically distributed random variables satisfying #[|Zj|] < oo. Let N 
be an integer-valued random variable whose value n depends only on the 
values of the first n Zj’s. Suppose S[N] < co. Then 

S[Z X + • • • + Z N ] = S[N] • *[Z,]. (64) 

proof t\z 2 + • • • + Zjy] = g\e\z x + ■ ■ ■ + Zj^|iV]] 

= | g[Z 1+ --- + Z n \N = n]P[N = n] 

/J = 1 

= £ £ <o[Z t \N = n]P[N = n\ 

«= 1 i= 1 

= IZ g[Z i \N = n\P[N = n\ 

i= 1 n - i 

= X £[Z t \N > i]P[N > i] 

i= I 

= X g[Z t ]P[N > i] 

i= I 

= <nz,] X P[N > i] 

i= 1 

= g[Z t ]g[Nl 
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{£ [Z t ] = <?[Z ; |iV > /] since the event {N > i} depends only on Z u Z,_ 1 

00 

and hence is independent of Z f . Also S[N] = £ P[N > /] follows from 

i=i 

Eq. (6) of Chap. II.) I HI 


If the sequential probability ratio test leads to rejection of , then the 

random variable Z t +- 1 - Z N < log e k 0 , but Z l H-h Z N is close to log e k 0 

since Z t + ■ ■ ■ + Z N first became less than or equal to log e k 0 at the Nth observa¬ 
tion; hence S[Z l + • ■ ■ + Z N ] » log e k 0 . Similarly, if the test leads to acceptance, 

i[Z x A -1- Z N ] x log e ki ;hence £[Z t + -h Z N \ « p\og e k 0 + (1 - p)^og e k u 

where p = P[3f 0 is rejected]. Using 


<^[Zi + • • • + Z N ] 
£[Z t ] 


P log e k 0 + (l — p) log e 
£[Z t ] 


we obtain 


Pf o is true] 


alog e A: 0 + (l -a) log, k, 

£[Zf \ is true] 

a log g [«/(! - p)] + (!-«) log, [(1 - «)//?] 
£[Zi\^ 0 is true] 


(65) 


and 


£[N | yp 0 is false] 


(1 ~ P) l0g e £q + P 10g e k i 

£[Z t | pf 0 is false] 


(1 - /?) log e [«/(1 -fi)\+P log, [(1 - «)//?] 
£[Z; | Pf 0 is false] 


( 66 ) 


EXAMPLE 27 Consider sampling from N(6, a 2 ), where a 2 is assumed known. 
Test o: 6 = 6 0 versus Pf l : 8 = 6 1 . Now 


Zi = log, 


/o(*i) 

/i(*i) 


= loge 


f(l l^2na)e-^ xt - 9 ^ 2 ' 




= ~^2 [(*.' - 0 o ) 2 - (*, - 0 i ) 2 ] 


Z(T 
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hence 


<?[Z;| o is true] = - [(#o - Of) - 26 0 (6 0 - 0 t )] 

2(7 

= 2^2 (fit — O 0 ) 2 , 
and 

*[Z«| -^o is fals e] = ( e i “ 0 o) 2 - 

2(7 


For a = .01, /? — .05, a 1 = 100, 0 O = 100, and 0 t = 105 (as in Example 26), 
Eq. (65) reduces to 


<$[N | 0 is true] 

and 


.01 log, (.01/.95) + .99 log, (.99/.05) 
25/200 


24 


£[N\ XC 0 is false] » 34. 

The average sample sizes of 24 and 34 for the sequential probability ratio 
test compare to a sample size of 64 for the fixed-sample-size test. //// 


PROBLEMS 

1 Let X have a Bernoulli distribution, where P[X = 1] = 6 = 1 — P[X = 0]. 

(а) For a random sample of size n = 10, test : 6 <i versus : 8 > l. Use 
the critical region (2 xi>6}. 

(i) Find the power function, and sketch it. 

(ii) What is the size of this test? 

(б) For a random sample of size n = 10: 

(i) Find the most powerful size-a (a = .0547) test of : 0 = i versus 

(ii) Find the power of the most powerful test at 8 = J. 

(c) For a random sample of size 10, test 6=1 versus 3 ^’ 1 : 6 = J. 

(i) Find the minimax test for the loss function 0= {(do; 6 0 ) = t(di ; 6 t ), 
t(d«\ 0,) = 1719, <f(d,; 6 0 ) = 2241. 

(ii) Compare the maximum risk of the minimax test with the maximum risk 
of the most powerful test given in part (6). 

(d) Again, for a sample of size 10, test Xf o'. 6 = l versus XPi : 6=\. Use the 
above loss function to find the Bayes test corresponding to prior probabilities 
given by 

3* 

^ = (1719/2241)(f) 10 + 3 4 ' 
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2 Let X have the density fix' 6) = 6x“~ l I(o, nC*). 

(a) To test : 6 < 1 versus Jf t : 6 > 1, a sample of size 2 was selected, and the 
critical region C = {(xi, x 2 ): 3/4xi < x 2 } was used. Find the power function 
and size of this test. 

(b) For a random sample of size 2, find the most powerful size-fa = i(l — In 2)] 
test of Xf 0 '- 6=1 veiv.'s Xf ,: 6=2. 

(c) Are the tests that you obtained in parts (a) and (b) unbiased ? 

id) For a random sample of size 2, find the minimax test of Xfo : 6 0 = 1 versus 
Xf8, = 2 using the loss fimction tid c ; 6 C ) = fid ,; 6,) = 0, fid 0 ; 6,) = 
1 - log e 2, fid ,; 6 0 ) = i + log e 2. 

(e) For a random sample of size n, find the Bayes test corresponding to prior 
probabilities given by g — I of Xf o : 6=1 versus Xfi : 6=2 using the loss 
function t{d 0 \ 6 0 ) = fid t ; 6 t ) = 0, i ((d 0 \ 6,) = 1, £(d ,; 6 0 ) — 2. 

(/) Test : 6 = 1 versus Jf,: 6=2 using a sample of size 2. Let a = size of 
Type I error and f = size of Type II error. Find the test that minimizes the 
largest of a and f. 

3 Let 0 = {1, 2}, and suppose you have one observation from the density 

/ (e _*. e+*)(*). Show that a test that has uniformly smallest risk among all tests 

exists, and find it. 

4 Let X be a single observation from the density 

f(x;6)=6 x °-'I i0 . nto, 

where 6 > 0. 

(a) In testing : 6 < 1 versus Xfi : 6 > 1, find the power function and size of 
the test given by the following: Reject Xf a if and only \f X> f 

(b) Find a most powerful size-a test of Jf 0 : 6=2 versus Xf i: 6=1. 

(c) For the loss function given by f(d 0 ; 2) = l{d t ; 1) = 0, i ?(d 0 ; 1) = tidi ; 2) = 1, 
find the minimax test of Xf 0 - 6=2 versus Xf6=1. 

(d) Is there a uniformly most powerful size-a test of Xf 0 : 6 > 2 versus Xf\: 6 <21 
If so, what is it? 

(e) Among all possible simple likelihood-ratio tests of Xf o'- 6=2 versus Xf ^: 
6=1, find that test that minimizes a + /?, where a and fi are the respective 
sizes of the Type I and Type II errors. 

if) Find the generalized likelihood-ratio test of size a of Xf o' 6=1 versus 
Xfs.O* 1. 

5 Let X be a single observation from the density fix; 6) = ( 26x + 1 — 6)I l0 , tfx), 

where —1 <6<1. 

(«) Find the most powerful size-a test of Xfo : 6=0 versus Xf,: 6=1. (Your 
test should be expressed in terms of a.) 

ib) To test Xfo'- 6 <0 versus Xf ,: 6 >0, the following procedure was used: 
Reject Xfo if X exceeds i. Find the power and size of this test. 

(c) Is there a uniformly most powerful size-a test of Xf 0 : # < 0 versus Xf t :6 >0? 
If so, what is it? 
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■} 

{d) What is the generalized likelihood-ratio test of 6=0 versus 3#* t : 8 ^ 0? 

(e) Among all possible simple likelihood-ratio tests of 3^ o'- 8=0 versus Jd’ 1 : 
8=1 find that test which minimizes a + /3, where a and are the respective 
sizes of the Type I and Type II errors, 

(/) Given a set of observations, all of which fall between 0 and 1, indicate how 
you would test the hypothesis that the ob c ovations came from the density 
fix; 8). 

6 Let X lt X„ denote a random sample from/( x; 8) = (l/0)/ (o , e) (jc), and let 
Yi, Y n be the corresponding ordered sample. To test 3f 0 - 8=8 0 versus 
3^i- 8^ 8 0 , the following test was used: Accept 3tfo if 6o{^ a) < Y„<8 0 ; 
otherwise reject. 

(a) Find the power function for this test, and sketch it. 

(b) Find another (nonrandomized) test that has the same size as the given test, 
and show that the given test is more powerful (for all alternative 6) than the 
test you found. 

7 Let Xi,X„ denote a random sample from 

f(x; 8) =(l/0)* <1 ~ e)/ '7 ( o. d(jc). 

Test tfo : 6 < 8 0 versus 3f?i : 8>8 0 . 

(a) For a sample of size n, find a uniformly most powerful (UMP) size-a test if 
such exists. 

(b) Take n = 2, 8 0 = 1, and a = .05, and sketch the power function of the UMP 
test. 

8 Let X t ,..., X n be a random sample from the Poisson distribution 

e~ e 8 * 

/( x; 8) = —— / {0 .i.2. ...)(*)• 

jc! 

(a) Find the UMP test of tffo : 8 = 8 0 versus Jfi : 8 > 8 0 , and sketch the power 
function for 8 0 = 1 and n = 25. (Use the central-limit theorem. Pick 
a= .05.) 

(b) Test Jfo: 8= 8 0 versus 3fC8 8 0 . Find the general form of the critical 

region corresponding to the test arrived at using the generalized likelihood- 
ratio principle. (The critical region should be defined in terms of 2 Xt .) 

(c) A reasonable test of : 8 = 8 0 versus Xfi: 8 ^ 8 0 would be the following: 
Reject if | X— 0 O | >K. For a = .05, find K so that P[reject 0 1 ^’o] = -05. 
(Assume that n is large enough so that the central-limit theorem can be used 
to find an approximation to K.) 

9 Let Q = {8 0 , 8i}. Show that any test arrived at using the generalized likelihood- 
ratio principle is equivalent to a simple likelihood-ratio test. 

10 To test Jfo : 8 < 1 versus XC i : 6 > 1 on the basis of two observations, say X t and 
X 2 , from the uniform distribution on (0, 8), the following test was used: 


Reject Xfo if X t + X 2 > 1. 
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(a) Find the power function of the above test, and note its size. [Recall that 
X, + X 2 has a triangular distribution on (0, 26).] 

(b) Find another test that has the same size as the given test but has greater power 
for some 8 > 1 if such exists. If such does not exist, explain why. 

11 Let Xi,X„ be a random sample of size n from f(x\ 6) = 8 2 xe~ ex I ( 0 . ,»>(.*). 

(a) In testing Xf 0 \ 8 < 1 versus Xfi : 8 > 1 for n= 1 (a sample of size 1) the 
following test was used: Reject if and only if X t < 1. Find the power 
function and size of this test. 

(b) Find a most powerful size-a test of Xf a: 8 = 1 versus Xf ,: 6 = 2. 

(c) Does there exist a uniformly most powerful size-a test of Xf a'. 8 < 1 versus 
Xf i : 8 > 1 ? If so, what is it? 

(d) In testing Xf 0 : 8=1 versus Xfi. 8=2, among all simple likelihood-ratio 
tests find that test which minimizes the sum of the sizes of the Type I and Type 
II errors. You may take n = 1. 

12 Let X u ..., X„ be a random sample from the uniform distribution over the interval 
(8, 8 + 1). To test X?o'- 8=0 versus Xf ,: 6 > 0, the following test was used: 
Reject Xf a if and only if Y n > 1 or Y, >k, where A: is a constant. 

(a) Determine k so that the test will have size a. 

(b) Find the power function of the test you obtained in part (a). 

(c) Prove or disprove: If k is selected so that the test has size a, then the given 
test is uniformly most powerful of size a. 

13 Let Xi, ..., X m be a random sample from the density 6ix 0l_l / (0 ,o(x), and let 
Y t ,..., Y„ be a random sample from the density 02j° 2_1 /(o. n(y). Assume that 
the samples are independent. Set U t = — \og e X,, i= 1, m, and Vj = 
-log* 10,/'= 1, 

(a) Find the generalized likelihood-ratio for testing Xf 0 ‘. 6 i = 8 2 versus 
Xf i: 6i ^ 6 2 . 

(/;) Show that the generalized likelihood-ratio test can be expressed in terms of 
the statistic 


T= 


I u. 

I u,+ 2 v/ 


(c) If Xf o is true, what is the distribution of T2 (You do not have to derive it if 
you know the answer.) Does the distribution of T depend on 6 = 6, = 8 2 
given that Xf a is true? 

14 Find a generalized likelihood-ratio test of size a for testing Xf 0 : 8 < 1 versus 
Xf i \ 6> 1 on the basis of a random sample Xi, ..., X n from f(x; 8) = 
8e~ 6x I { o, «,)(*). 

15 Let X be a single observation from the density f(x; 6) = (1 + 8)x e I i0 , i>(•*), where 

8>-l. 

(a) Find the most powerful size-a test of Jf 0 : 8=0 versus Xf i’. 8=1. 

( b ) Is there a uniformly most powerful size-a test of 3*f a : 8 < 0 versus Xf i : 8 > 0? 
If so, what is it? 
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(c) Among all possible simple likelihood-ratio tests of 37? 8 = 0 versus 37? i\ 

6=1 find a test which minimizes 2 a + /3, where a and j8 are the respective 
sizes of the Type I and Type II errors. 

(d) Find a generalized likelihood-ratio test of J? c : 8 = 0 versus 37? 2 : 8^0. 

16 Let X it X m bea random sample from 6,e _e i*/ (0 and let Y t ,.. ., Y„ be a 
random sample from 8 2 e~ e2y I i0 , «,■>(}>). Assume that the samples are independent. 
(a) Find the generalized likelihood-ratio for testing 37? o'- 8 1 = 6 2 versus 

3?i:8 t ^8 2 . 

(Jb) Show that the generalized likelihood-ratio test can be expressed in terms of 
the statistic T= 2 X,j{^_ X t + 2 Yj)- Argue (or show) that the distribution 
of T does not depend on 8 = 8 t = 8 2 when 37? 0 is true. 

17 Use the confidence-interval technique to derive a test of 37? 0 : pi = p 2 versus 

fxi fx 2 'm sampling from the bivariate normal distribution. Such a test is 
often called a paired t test. (See the last paragraph in Subsec. 3.4 in Chap. VIII.) 

18 Given the sample (— .2, — .9, — .6,. 1) from a normal population with unit variance, 
test whether the population mean is less than 0 at the .05 level (i.e., with probability 
.05 of a Type I error). That is, test J ? 0 : ja < 0 at the .05 level versus 37? i : p > 0. 

19 Given the sample (— 4.4,4.0, 2.0, — 4.8) from a normal population with variance 4 
and the sample (6.0, 1.0, 3.2, —.4) from a normal population with variance 5, test 
at the .05 level that the means differ by no more than one unit. Plot the power 
function for this test. Plot the ideal power function. 

20 A metallurgist made four determinations of the melting point of manganese: 
1269,1271, 1263, and 1265 degrees centigrade. Test the hypothesis that the mean 
p of this population is within 5 degrees centigrade of the published value of 1260. 
Use a = .05. (Assume normality and a 2 = 5.) 

21 Plot the power function for a test of the null hypothesis 37 ? 0 : — 1 < p. < 1 for a 
normal distribution with known variance using sample sizes 1,4,16, and 64. (Use 
the standard deviation a as the unit of measurement on the p axis and .05 proba¬ 
bility of Type I error.) Plot the ideal power function. 

22 Let Xi ,..., X n be a random sample of size n from a normal density with known 
variance. What is the best critical region for testing the null hypothesis that the 
mean is 6 against the alternative that the mean is 4 ? 

23 Derive a test of ,^ 0 : o 2 < 10 against Xf i : o 2 > 10 for a sample of size n from a nor¬ 
mal population with a mean of 0. 

24 In testing between two values p 0 and pi for the mean of a normal population, 
show that the probabilities for both types of error can be made arbitrarily small 
by taking a sufficiently large sample. 

25 A cigarette manufacturer sent each of two laboratories presumably identical 
samples of tobacco. Each made five determinations of the nicotine content in 
milligrams as follows: (i) 24, 27, 26, 21, and 24 and (ii) 27, 28, 23, 31, and 26, 
Were the two laboratories measuring the same thing? (Assume normality and a 
common variance.) 
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26 The metallurgist of Prob. 20, after assessing the magnitude of the various errors 
that might accrue in his experimental technique, decided that his measurements 
should have a standard deviation of 2 degrees centigrade or less. Are the data 
consistent with this supposition at the .05 level? (That is, test 

27 Test the hypothesis that the two samples of Prob. 19 came from populations with 
the same variance. Use a = .05. 

28 The power function for a test that the means of two normal populations are equal 
depends on the values of the two means ii t and /j . 2 and is therefore a surface. But 
the value of the function depends only on the difference 6 = p i — yx 2 , so that it 
can be adequately represented by a curve, say /3 (fi). Plot fi(6) when samples of 4 
are drawn from one population with variance 2 and samples of 2 are drawn from 
another population with variance 3 for tests at the .01 level. 

29 Given the samples (1.8, 2.9, 1.4, 1.1) and (5.0, 8 . 6 , 9.2) from normal populations, 
test whether the variances are equal at the .05 level. 

30 Given a sample of size 100 with X = 2.7 and 2 (-*) — X) 2 = 225, test the null 
hypothesis n = 3 and o 2 = 2.5 at the .01 level, assuming that the population 
is normal. 

31 Using the sample of Prob. 30, test the hypothesis that /j. = a 2 at the .01 level. 

32 Using the sample of Prob. 30, test at the 0.1 level whether the .95 quantile point, 
say £, = £. 95 , of the population distribution is 3 relative to alternatives £ < 3. 
Recall that £ is such that Ji „ fix) dx = .95, where fix) is the population 
density; it is, of course, /j. + 1.645a in the present instance where the distribution is 
assumed to be normal. 

33 A sample of size n is drawn from each of k normal populations with the same 
variance. Derive the generalized likelihood-ratio test for testing the hypothesis 
that the means are all 0. Show that the test is a function of a ratio which has 
the F distribution. 

34 Derive the generalized likelihood-ratio test for testing whether the correlation of 
a bivariate normal distribution is 0 . 

35 If Xt , X 2 ,.. •, X„ are observations from normal populations with known variances 
a 2 , a|,..., oi, how would one test whether their means were all equal? 

36 A newspaper in a certain city observed that driving conditions were much improved 
in the city because the number of fatal automobile accidents in the past year was 9 
whereas the average number per year over the past several years was 15. Is it 
possible that conditions were more hazardous than before? Assume that the 
number of accidents in a given year has a Poisson distribution. 

37 Six 1-foot specimens of insulated wire were tested at high voltage for weak spots 
in the insulation. The numbers of such weak spots were found to be 2, 0,1, 1, 3, 
and 2. The manufacturer’s quality standard states that there are less than 120 
such defects per 100 feet. Is the batch from which these specimens were taken 
worse than the standard at the .05 level ? (Use the Poisson distribution.) 
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38 Consider sampling from the normal distribution with unknown mean and variance: 

(a) Find a generalized likelihood-ratio test of Jf’ot o 2 < al versus i : a 2 > og. 

(b) Find a generalized likelihood-ratio test of o 2 = o% versus JFi \ o 2 ^ a%. 

39 (a) Suppose (Ni, ..., N k ) is multinomially distributed with parameters n, 

k 

Pi, ■ where N k+1 = n— N t - N k &ndpk+ t = 1 — Jpj. Theorem 

8 states that 1 


Q=Q* = 


Nj-npjy 
j=i npj 


has a limiting chi-square distribution. Find the exact mean and variance 
of Q. 

(b) Let (N t ,..., N k ) be distributed as in part (a). Define 


G2 = 


(Nj-npJ) 2 
j-i np°j 


[See Eq. (25).] Find £[QSl [See Eq. (26).] Is &[Qt] for pi=p\, .... 

Pk+i= pi* i less than or equal to <o[Qi] for arbitrary p u ...,p k + i1 

40 A psychiatrist newly employed by a medical clinic remarked at a staff meeting that 
about 40 percent of all chronic headaches were of the psychosomatic variety. His 
disbelieving colleagues mixed some pills of plain flour and water, giving them to 
all such patients on the clinic’s rolls with the story that they were a new headache 
remedy and asking for comments. When the comments were all in they could be 
fairly accurately classified as follows: (i) better than aspirin, 8, (ii) about the same 
as aspirin, 3, (iii) slower than aspirin, 1, and (iv) worthless, 29. While the doctors 
were somewhat surprised by these results, they nevertheless accused the psychiatrist 
of exaggeration. Did they have good grounds? 

41 A die was cast 300 times with the following results: 


Occurrence: 1 2 3 4 5 6 

Frequency: 43 49 56 45 66 41 

Are the data consistent at the .05 level with the hypothesis that the die is true? 

42 Of 64 offspring of a certain cross between guinea pigs, 34 were red, 10 were black, 
and 20 were white. According to the genetic model, these numbers should be in 
the ratio 9/3/4. Are the data consistent with the model at the .05 level? 

43 A prominent baseball player’s batting average dropped from .313 in one year to 
.280 in the following year. He was at bat 374 times during the first year and 268 
times during the second. Is the hypothesis tenable at the .05 level that his hitting 
ability was the same during the two years? 

44 Using the data of Prob. 43, assume that one has a sample of 374 from one Bernoulli 
population and 268 from another. Derive the generalized likelihood-ratio test 
for testing whether the probability of a hit is the same for the two populations. 
How does this test compare with the ordinary test for a 2 x 2 contingency table? 
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45 The progeny of a certain mating were classified by a physical attribute into three 
groups, the numbers being 10, 53, and 46. According to a genetic model the 
frequencies should be in the ratios p 2 i'2p(l — />)/( 1 — p) 2 . Are the data consistent 
with the model at the .05 level ? 

46 A thousand individuals were classified according to sex and according to whether 
or not they were color-blind as follows: 



Male 

Female 

Normal 

442 

514 

Color-blind 

38 

6 


According to the genetic model these numbers should have relative frequencies 
given by 


P 

2 



£ 

2 


2 


where q = 1 — p is the proportion of color-blind individuals in the population. 
Are the data consistent with the model? 

47 Treating the table of Prob. 46 as a 2 x 2 contingency table, test the hypothesis that 
color blindness is independent of sex. 

48 Gilby classified 1725 school children according to intelligence and apparent family 
economic level. A condensed classification follows: 



Dull 

Intelligent 

Very capable 

Very well clothed 

81 

322 

233 

Well clothed 

141 

457 

153 

Poorly clothed 

127 

163 

48 


Test for independence at the .01 level. 

49 A serum supposed to have some effect in preventing colds was tested on 500 indi¬ 
viduals, and their records for 1 year were compared with the records of 500 un¬ 
treated individuals as follows: 





More than 


No colds 

One cold 

one cold 

Treated 

252 

145 

103 

Untreated 

224 

136 

140 


Test at the .05 level whether the two trinomial populations may be regarded as the 


same. 
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50 According to the genetic model the proportion of individuals having the four 
blood types should be given by: 


0:q 2 

A: p 2 + 2pq 
B:r 2 + 2qr 
AB: 2 pr 

where p + q + r= I. Given the sample O, 374; A, 436; B, 132; AB, 58; how 
would you test the correctness of the model ? 

51 Galton investigated 78 families, classifying children according to whether or not 
they were light-eyed, whether or not they had a light-eyed parent, and whether or 
not they had a light-eyed grandparent. The following 2x2x2 table resulted: 


Grandparent 


Light 


Not 


Parent 


Light Not Light Not 


Light 1928 552 596 508 

Not 303 395 225 501 


Test for complete independence at the .01 level. Test whether the child classifica¬ 
tion is independent of the other two classifications at the .01 level. 

52 Compute the exact distribution of A for a 2 x 2 contingency table with marginal 
totals Ni. = 4, N 2 . = 7, N A = 6, N. 2 = 5. What is the exact probability that 
— 2 loge A exceeds 3.84, the .05 level of a chi-square distribution for one degree of 
freedom? 

53 In testing independence in a 2 x 2 contingency table, find the exact distribution of 
the generalized likelihood-ratio for a sample of size 2. Do the same for samples 
of size 3 and 4. Discuss. 

54 Let Xi, ..., X„ be a random sample from N(p, a 2 ), where a 2 is known. Let A 

denote thegeneralized likelihood-ratio for testing 3^ o'- /* = p-o versus 3f 1 : p 0 . 

Find the exact distribution of —2 logeA, and compare it with the corre¬ 
sponding asymptotic distribution when 0 is true. Hint: 2 (A) — X) 2 = 
Z(X t -p) 2 -n(X-p) 2 . 

55 Here is an actual sequence of outcomes for independent Bernoulli trials. Do you 
think p (the probability of success) equals £? 


* fffs, S fs //, J ////, ffsfs , 5 ////, 
Sffsf ffffs, ffsff, s ////, ssfff. 
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If you do not think p is i, what do you think p is? Give a confidence-interval 
estimate of p. If the above data were generated by tossing two dice, then what 
would you think pis? If the data were generated by tossing two coins, then what 
would you think p is ? (If the data were generated by tossing two dice, assume that 
the possible values of p are j/ 36, j = 0, ..., 36. If the data were generated by 
tossing two coins, assume that the possible values of p are ]/ 4, j = 0, ..., 4.) 

56 In sampling from a Bernoulli distribution, test the null hypothesis that p = J 
against the alternative that p = h Let p refer to the probability of two heads 
when tossing two coins, and carry through the test by tossing two coins, using 
a = jS = .10. (The alternative was obtained by reasoning that tossing two coins 
can result in the three outcomes: two heads, two tails, or one head and one tail, 
and then assuming each of the three outcomes equally likely.) 

57 Show that the SPRT (sequential probability ratio test) of p = p 0 versus p = p, for 
the mean of the normal distribution with known variance may be performed by 
plotting the two lines 


y = 


Pi — po 


1 / , P o + P-i 

loge k o + --- n 


and 


y = 


Pi — p-o 


i_ I , P-0 + pi 
log. fc, + ----77 


in the ny plane and then plotting 2 against n as the observations are made. 

1 

The test ends when one of the lines is crossed. 

58 Consider sampling from f(x; 6) = (I/#)/, 0 . S) (x), 0>O. Discuss the sequential 
probability ratio test of 6= 6 0 versus 8= with 6 0 < 6 1 . 

59 Let Xi, X 2 , ..., X„, ... be independent random variables all having the same 
Bernoulli distribution given by P[X„ = I ] = 6 = 1 — P[X n = 0]. To test : 6 = J 
versus 6= }, the following sequential test was used: Continue sampling as 
long as nil — 2 < 2 Xt < nil + 2 ; if and when 2 x > > s first less than or equal to 
77/2 — 2 , accept Xf a \ and if and when 2 x, is first greater than or equal to nil -|- 2 , 
accept Mfy. Is this test a SPRT? 

60 Assume that X has a Poisson distribution with mean 6. Consider testing 
XP 0 : 8 = 1 versus TTC 2 : 6=2. Fix a = /J = .05. 

(a) Find the fixed sample size necessary to achieve the prescribed error sizes. 

( b) Derive the (approximate) sequential probability ratio test, and show that it 

n 

can be based on the statistics 2 X t , n= l, 2,.... 

1=1 

(c) Find the approximate expected sample sizes for the sequential probability 
ratio test. 
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1 INTRODUCTION AND SUMMARY 

The purpose of this chapter is to discuss a special case of the linear statistical 
model. There is a large amount of material available on this subject, but we 
will only discuss the special case of the simple linear model. Some authors 
refer to this as the theory of straight-line regression. To study this model will 
require the use of some of the theory in previous chapters, such as distribution 
theory, point and interval estimation concepts, and some material from hypoth¬ 
esis testing. This chapter will demonstrate how these concepts can be utilized 
for a situation, the simple linear model, that is important in applied statistics. 

In Sec. 2 two examples are given to illustrate how the simple linear model 
can be used to simulate real-world problems. In Sec. 3 the simple linear model 
will be rigorously defined and put into a framework that will allow us to study it 
by using statistical procedures from previous chapters. In the remaining 
sections discussion will be centered around point estimation, interval estimation, 
and testing hypotheses on the parameters in the model under two different 
assumptions about the distributions of the random variables in the model. 
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2 EXAMPLES OF THE LINEAR MODEL 

In this section we will give two examples to illustrate how the linear model arises 
in applied problems. 


EXAMPLE 1 The distance s that a particle travels in time t is given by the 
formula s = p 0 + Pf, where /? x is the average speed and p 0 is the 
position at time t = 0. If p 0 and Pi are unknown, then s can be observed 
for two distinct values of t and the resulting two equations solved for fi 0 
and f} v For example, suppose that s is observed to be 2 when t = 1, and 
s is 11 when t = 4. This gives 2 = P 0 + pi and 11= /S 0 + 4and the 
solution is p 0 = —1, pi = 3; so s — — 1 + 3 1. Suppose that for some 
reason the distance cannot be observed accurately, but there is a measure¬ 
ment error which is of a random nature. Therefore s cannot be observed, 
but suppose that we can observe Y, where Y = s + E and £ is a random 
error whose mean is 0. Substituting for 5 gives us 

Y = p 0 +p lt + E, (1) 

where Kis an observable random variable, t is an observable nonrandom 
variable, Lis an unobservable random variable, and p 0 and /?, are unknown 
parameters. We cannot solve for P 0 and /J, by observing two sets of 
values of Y and t, as we did with s and t above, since there is no functional 
relationship between Y and t. The objective in this model is to find P 0 
and Pi and hence evaluate s = p o + pf for various values of t. Since s 
is subject to errors and cannot be observed, we cannot know P 0 and Pi, 
but by observing various sets of Y and t values statistical methods can be 
used to obtain estimates of P 0 , p u and s. This type of model is a 
functional-relationship model with a measurement error. //// 


EXAMPLE 2 For another example, consider the relationship between the 
height h and weight w of individuals in a certain city. Certainly there is no 
functional relationship between w and h, but there does seem to be some 
kind of relation. We shall consider them as random variables and shall 
postulate that (W, H) has a bivariate normal distribution. Then the 
expected value of H for a given value w of IK is given by 

S[H\ W = w] = p 0 + PiW, (2) 

where p 0 and Pi are functions of the parameters in a bivariate normal 
density. Although there is no functional relationship between H and W, if 
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they are assumed jointly normal, there is a linear functional relationship 
between the weights and the average value of the heights. Thus we can 
write the following: H and ITare jointly normal, and 

S[H \ W = w] = p 0 + p t w; 

or we can write 


— P 0 + Pi w + E, 

where E is a normally distributed random variable denoting error. This 
is a regression model, and although it came from a somewhat different 
problem than the functional relationship in Example 1, they both are 
special cases of a linear statistical model, which will be discussed in this 
chapter. //// 


3 DEFINITION OF LINEAR MODEL 

Let g( •) be a linear function of a real variable x. This is defined by g(x) = 
Po + Pi x ’ where x is in a domain D. Quite often D will be the entire real line, 
a half line, or a bounded interval on the real line. To model the situations 
referred to in Examples 1 and 2 above, we assume that there exists a family of 
c.d.f.’s (one c.d.f. for each x in D) such that the mean of the c.d.f. corresponding 
to a given x (say x 0 ) in D is P 0 + fi x x 0 . Thus the means of the c.d.f.’s are on the 
line defined by g(x) = P 0 + Pi x - See Fig. 1. The objective is to sample some 
of the c.d.f.’s and on the basis of the sample to make statistical inferences about 

Po > Pi> 

The sampling is accomplished as follows: 

(i) A set of n x’s in D is observed and denoted by Xj, x 2 , ..., x„ . 
The x’s are not random variables, but they may be selected either by 
some random procedure or by purposeful selection. 

(ii) Each x ; determines a c.d.f. whose mean is Po + Pi x t and 

whose variance is a 2 . From this c.d.f. a value is selected at random and 
denoted by Y ; . ( Y t is a shortened notation for Y Xi .) 

Thus we have a set of n pairs of observations, which we denote by (Yj, xj, 
(Y 2 , x 2 ), ..., (Y„, x„). We have assumed that 

] = Po + Pi x t 

and 

var [Y f ] = a 2 . 
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Therefore we can define random variables E u E 1 , ..., E n by 

E t = Y t - p 0 - piXi for i = 1, 2,..., n, 


and the E t satisfy 


and 


So we can write 


*[E,] = 0 


var [£(] = a 2 . 


Y t = P 0 + Pi x t + E t for i = 1 , 2, ..., n, 


where 


[£ ; ] = 0 and var [£,-] = a 2 , 
and this defines a linear model. We summarize these ideas below. 

Definition 1 Linear model Let the function ji( ■) be defined by 
n(x) = P 0 + Pix for all x in a set D. For each x in D let F Yx ( •) be a 
c.d.f. with a mean equal to n(x), that is, P 0 + P t x, and variance a 2 . Let 
x lt x 2 , •.., x n be an observed set of fix’s from D. For x t let Y ( be a random 
sample of size 1 from the c.d.f. F Yx .(• ) for i = 1,2,..., «. Then ( , x^, 
(Y 2 , x 2 ), ..., (Y„, x„) is a set of n observations related by 
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d>[Y t ] — f) 0 + PiXi 

and (3) 

var [Y ( ] = a 2 , i=l,2,...,n. 

These specifications define a linear statistical model. //// 

Note We can write Eq. (3) as 

Yi = Po + Pi x i + Ei 

*\ e & = o (4> 

var [is,] = a 2 , 

where i = 1, 2,... , n. 11II 


Note The word “linear” in “linear statistical model” refers to the fact 
that the function p( •) is linear in the unknown parameters. In the 
simple example we have referred to, / l ( -) is defined by p(x) = p 0 + /?,x; x 
in D, and this is linear in x, but this is not an essential part of the definition 
of this linear model. For example, Y = p(x) + E, where p(x) = P 0 + p x e x 
is a linear statistical model. //// 

Note In many situations some additional assumptions on the c.d.f. 
F Yx ( •) will be made, such as normality. Also, generally the sampling 
procedure will be such that the Y ; will be either jointly independent or 
pairwise uncorrelated. In fact we shall discuss inference procedures for 
two sets of assumptions on the random variables defined in Cases A and B 
below. ///I 

Case A For this case we assume that the n random variables are jointly 
independent and each Y ( is a normal random variable. //// 

Case B For this case we assume only that the Y f are pairwise uncor¬ 
related; that is, cov [Y ; , Y y ] = 0 for all i = 1, 2,. . n. //// 

For Case A we shall discuss the following: 

(i) Point estimation of p 0 , P u cr 2 , and p(x) for any x in D 

(ii) Confidence interval for P 0 , Pi, c 2 , and p(x) for any x in D 

(iii) Tests of hypotheses on p o , p u and a 2 

For Case B we shall discuss the following: 

(iv) Point estimation of p o , p u a 2 , and p(x) for any x in D 
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4 POINT ESTIMATION—CASE A 

For this case Y,, Y 2 , ..., Y„ are independent normal random variables with 
means p 0 + PiX t , p 0 + p t x 2 , P 0 + P\X n and variances a 2 . To find point 
estimators, we shall use the method of maximum likelihood. The likelihood 
function is 


L(p 0 ,Pu a ) =L(P 0 ,p u a 2 -,y u y 2 , y„) 


TM/ 1 V /2 .. Jyt-Po-P^iV] 

flits?) “ p rH—;—)J 


(5) 


and 


log L(p 0 , p x , a 2 ) = - - log 2ji ~ log o 2 — ——j £ (y, - P 0 - P^) 2 . 
I l la i= i 


The partial derivatives of log L(P 0 , p l , a 2 ) with respect to p 0 , p u and a 2 are 
obtained and set equal to 0. We let ft 0 , p u a 2 denote the solutions of the 
resulting three equations. The three equations are given below (with some 
minor simplifications): 


Z (Yt ~h~ Pixd = 0 

i= 1 

i(yt-$o-$ixdx, = 0 ( 6 ) 

i= 1 

Z (y> -h- h*i) 2 = ™ 2 

i= 1 

The first two equations are called the normal equations for determining p 0 and 
/),. They are linear in p 0 and and are readily solved. We obtain 


3 Z (J\-- YX*i - *) 

Pl Z(.x<-x) 2 
Po = y- /M 

a 2 =- Z(Yi -^ 0 ~$\Xif- 
n ,-= i 


(7) 

( 8 ) 
(9) 


These are maximum-likelihood estimates of p u P 0 , and a 2 , respectively. We 
notice that the x f ’s must be such that £ (x t — x ) 2 # 0; that is, there must be at 
least two distinct values for the x t ■ 
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Note that since 


fY t (yr, Po , Pi, a) = (2n(T 2 ) * exp ^ 

= (lno 2 y* eX P [- ^2 (0o + PiXi) 2 ^ 
i 1 2 Po Pi \ 

x CXP ( ~~ la 2 y ‘ + a 2 y '' + a 2 X,yi )’ 

f Yi (y t ; Po, Pi, a) is a member of a three-parameter exponential family; hence, 
by a generalization of Theorem 16 of Chap. VII 

tv t 2 , £ Yi, t x i Y i (10) 

i=l i=l i=l 

is a set of minimal sufficient and jointly complete statistics. Furthermore, 
since the set of statistics given in Eq. (10) is a one-to-one transformation of the 
estimators (statistics) defined by Eqs. (7) to (9), the estimators are themselves 
minimal sufficient and jointly complete. 

To further examine the properties that the estimators possess, we shall 
find the joint distribution of statistics corresponding to p 0 , /),, a 2 . To do this, 
we shall first find the moment generating function of ©,, 0 2 > an d © 3 , which 
are random quantities with values defined by 




( 11 ) 


By Definition 25 of Chap. IV the joint moment generating function of © t , © 2 , 
® 3 is defined to be 

m(t u t 2 , t 3 ) = 

if the expectation exists for —h<ti<h for some h > 0. We obtain 


J ~ 00 J ~ 00 

e xp [ - ~2 Z (y £ ~Po~ Pixd 2 ] 

w L J 


dy i dy„. 


(2na 2 T 12 - 1 

where in the integral the quantities 0 ,, & 2 , § 3 will be written in terms of y t and x t . 
This integral is straightforward but tedious to evaluate, and the result is 


m(,„ = (exp j{,; 





4 


POINT ESTIMATION—CASE A 489 


From this moment generating function we can learn a number of things: 


(i) It factors into a function of I, and t 2 only times a function of t 3 
only. We write this result as m(t 1 , t 2 , t 3 ) = m 2 (t u t 2 )m 2 (t 3 ). By using 
Theorem 10 of Chap. IV we know that the random variables associated 
with t, and t 2 are independent of the random variable associated with 
t 3 ; that is, 0, and © 2 are independent of © 3 , which implies that the maxi¬ 
mum-likelihood estimators of p 0 and /?, are jointly independent of the 
maximum-likelihood estimator of <x 2 . 

(ii) Since by a generalization of Theorem 7 in Chap. II a moment 
generating function uniquely determines the distribution of the random 
variables involved, we shall try to recognize the form of m,(r,, t 2 ) and the 
form of m 2 (t 3 ). We note by Theorem 12 of Chap. IV that t 2 ) is the 
moment generating function of a bivariate normal distribution, and, of 
course, we obtain the means, variances, and covariance. We see that 
the random variables, say fi 0 and 6,, associated with /? 0 and /?, are bivari¬ 
ate normal random variables with means (/J 0 , /?,) and covariance matrix 


r o 3 !*? 

n Y. ( x t - *) 2 


L X(*«- *) 2 

Another way to state this is the following: (fi 0 , ft,) is a bivariate normal 
random variable with parameters 

[®o] = Po > 


I (x t - x) 2 

a 2 

I (*, - Z ) 1 


( 12 ) 


and 


a 2 Y xf 


var [6,] 


Y(x t -xf 


cov [S 0 , S,] = 


■o 2 x 


I (X; - X ) 2 


(iii) We recognize that m 2 (t 3 ) is the moment generating function 
of a chi-square random variable with n —2 degrees of freedom. Hence 
we have 



= ^2 I ( y i - fi o - Si*i) 2 . 
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which is distributed as a chi-square distribution with n —2 degrees of 
freedom. (Here, and in the rest of this chapter, & 2 and S 2 are used to 
denote the random variables with values a 2 and a 2 respectively.) By 
Eq. (22) of Chap. VI we get 



so we define a 1 by 

a 1 = ° 2 = £ (y t ~Po~ Pi x t) 2 - 

n - 2 n — 2 £ = i 

We shall summarize these results in the following theorem. 


Theorem 1 Consider Case A of the simple linear model given in 
Definition 1. The maximum-likelihood estimators of />,, p 0 , and a 2 
(corrected for bias) are given by 


« I (k ; - Y)(x t - x) 

Y(Xi-x ) 2 ’ 


i 0 = ^- 


>i A » 


d 2 = 


n — 2 




(13) 


These estimators satisfy the following: 

(i) They are jointly complete sufficient statistics. 

(ii) They are unbiased estimators of their respective parameters. 

(iii) (fi 0 , ft,) is independent of (s 2 . 

(iv) (fi 0 , B,) has a bivariate normal distribution with mean (/J 0 , /Jj) 
and covariance matrix given by Eq. (12). 

(v) ( n — 2)o 2 /<t 2 is a chi-square random variable with n — 2 

degrees of freedom. //// 


In Chap. VII, we noted that maximum-likelihood estimators possess a 
number of good properties, but they, in general, are not minimum-variance un¬ 
biased estimators. We now employ a minor generalization of Theorem 17 
of Chap. VII along with the results of Theorem 1 above, to state a strong op¬ 
timal property about the estimators fi 0 > ®i> ® 2 °f fio- Pi> ° 2 • 

Theorem 2 Consider the simple linear model given in Definition 1 . Let 
t(/5 0 , Pi, a 2 ) be any known function of the parameters p 0 , p u and a 2 
for which an unbiased estimator exists. Then there exists an unbiased 
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estimator of r(/? 0 , /?,, a 2 ) that is afunction of fi 0 . Si, and (s 2 . Wedenote 
this estimator by /(fi 0 , fit, d 2 )> and it is the UMVUE of t(/J 0 , p u a 2 ). 

Proof This result follows from a generalization of Theorem 17 
of Chap. VII, since fi 0 . 6 2 is a set of sufficient complete statistics. //// 

Corollary The UMVUE of each of the parameters p o , /?,, and a 2 is 
given by fi 0 , and d 2 , respectively, in Theorem 1. //// 

Corollary The UMVUE of n(x) = Po + Pi* for any x in the domain 
D is p(x), where p(x) = fi 0 + fij*- (A(*) is the random variable with 
values n(x) = P 0 + PiX.) //// 

Corollary For any two known constants c 1 and c 2 the UMVUE of 

c iPo + c iP\ is Cifi 0 +c 2 6i- /III 


5 CONFIDENCE INTERVALS—CASE A 


To obtain a y-level confidence interval on a 2 , we note by Theorem 1 that 


£/ = 


(n - 2)d 2 


is distributed as a chi-square random variable with n —2 d.f. (degrees of free¬ 
dom). Hence V is a pivotal quantity, and we get 

^[*0 -y )/2 (« - 2) < U <, xl l+y) i 2 {n - 2)] = y. 


If we substitute for U and simplify, we get 

p [ 2 (”- 2 >* 2 - ^ ,("- 2^.-1 

LX(l+y)/ 2 ( w — 2) X(l-y)/2( n 2) 


(14) 


and this is a lOOy percent confidence interval on a 2 . 

To obtain a y-level confidence interval on P 0 , we note that by Theorem 1: 


(i) Z = (fi 0 — Po)\/Yj ( x t ~ x) 2 n/<T 2 Y. xf is distributed as a stand¬ 
ard normal random variable. 

(ii) (n — 2 )d 2 /a 2 = U is distributed as a chi-square random variable 
with n — 2 d.f. 

(iii) Z and U are independent. 

Hence, by Theorem 10 of Chap. VI 


fio — Po 


! f> z (*£ - x ? 
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is distributed as Student’s t distribution with n — 2 d.f. Hence T is a pivotal 
quantity. We get 

P[—t(i+y)l 2 ( n — 2) ^ T < t (1+y)/2 (n — 2)] = y, 
and if we substitute for T, we get 


{ 


P\ + ~ 2) < 


fip - Po In Z (*■ ~ X) 


d 


Z*, ? 


|2 ^ *(l+y)/2( M 




After simplifying we get the following for a lOOy percent confidence interval 
on P 0 : 


0 fio- 


t(i + y)li( n — 2)d 


Ex? 


' n X (x t ~ x ) 2 

< fio ^ + *(i + y)/2( n — 2)<^ 

We note that by Theorem 1 we get 


/ z*. ? 

«E (*.• - x ) 


i =y- 


var [fi 0 ] = ff 2 ^ 


n Z ( x t - *) 2 ’ 

and the estimated variance of 6 0 , which we write as var [6 0 L is given by 


"x 2 

Jx i 


n Z ( x i ~ x ) 2 ’ 


var [fi 0 ] = & 

Then the confidence statement can be written as 

P[fi 0 - t ( i+y)/ 2 (« - 2) v /var[6o] < jS 0 ^ 6 0 + t (i+y)/2 (n - 2)V v ar [B 0 ]] = V- 

(15) 

To obtain a y-level confidence interval on p lt we note that by Theorem 1: 

(i) Z = (fij - ^i) v /Z ( x i ~ *) 2 / ff2 is distributed as a standard 
normal random variable. 

(ii) {/ = (« — 2)d 2 /cr 2 is distributed as a chi-square random variable 
with n — 2 d.f. 

(iii) Z and U are independent. 

Hence by Theorem 10 of Chap. VI 




/Z (*i - *) 2 
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is distributed as Student’s t distribution with n — 2 d.f. Hence T is a pivotal 
quantity, and we get 

B[ ~ h i + y)/i( n ~ 2) < T < t( ] + y )/ 2 (« — 2)] = y. 

If we substitute for T, we obtain 

+ ~ 2) ^ (fii — ^ *2 — h l +y)li( n ~ = 7- 

After some simplification we get 

P [ fil “ ~ 2) JZ(x t - x) 2 - fil - Bl 

+ l (l +y)ll( n ~ 2), 


I a 2 - _ 

Z (*< - x ) 2 . y ’ 


and this is a lOOy percent confidence interval on /Jj. We note from Theorem 1 
that 


and that the estimated variance of fi,, which is denoted by var [fi,], is given by 

^ rfi , d 2 
var [B,] =; 


Z (*i ~ *) 2 

If we substitute this into the confidence-interval statement, we obtain 


-P[Bi - f ( i + V )/ 2 (« — 2) N /var [BJ < Pi < S, + t 0+y)/2 (n - 2)^/\&r [fij] = y. 

(16) 


To obtain a y-level confidence interval on /i(x) for any x in the domain D, 
we note that 


(i) 

CO 

(iii) 

(iv) 


H( X )=Po + P !*■ 


p.(x) =fi 0 + Si(x). 

^EA(x)] = n(x). 

var |ft(x)] = var [B 0 + B,x] 

= var [6 0 ] + 2xcov [B 0 ,6,] + x 2 

t (IA 

Z(x ( -x) 2 \ n 


2xx + x 2 1 


= a 


1 

n + Z (* 


(x - xf - 
( - x ) 2 - ' 


var [fi,] 
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(v) Z = [fi(x) — /i(x)]/ x /var [ft(x)] is distributed as a standard 
normal random variable. 

(vi) U = (n — 2)d 2 /ff 2 is distributed as a chi-square random vari¬ 
able with n — 2 d.f. 

(vii) U and Z are independent. 

(viii) T = ~ = fip + fiix-jgp-^x 

V var[ji(x)] ^d 2 [l/« + (x - x) 2 /£ (x £ - x) 2 ] 

is distributed as Student’s t distribution with n —2 d.f. Hence T is a 
pivotal quantity, and we obtain 


P[ — t(i+ y )/ 2 ( n — 2) < T < t( t+y)/i( n 2)] — y. 

If we substitute for T and simplify, we get 

+ ®i x — *(i +v)!i( n ~ ~ + — x ) 1 ~ + ^ lX 


< fi 0 + ^lX + t(i + V )/2(« + 2) /d 2 [-+ 


2 1 


E (*» - *) 2 


= 7 


or 

P[fi 0 + Six - t ( i +v) / 2 (« - 2)y /var [£(*)] <P 0 + P x x 


< B 0 + f^x + t (l+y)/2 (n - 2) v / v^ r [ji(x)]] = y, 
and a lOOy percent confidence interval for P 0 + PiX is obtained. 


6 TESTS OF HYPOTHESES—CASE A 

In the linear model there are many tests that could be of interest to an investi¬ 
gator. For example, he may want to test whether the line goes through the 
origin, i.e., to test if the intercept is equal to zero, or perhaps test whether the 
intercept is positive (or negative). These are indicated by 

Jf 0 : Pq — 0 versus Jfj: P 0 # 0, 

3? 0 : P o >0 versus 3f l '.p o < 0, 

3? 0 : P o <0 versus 3^ t : P 0 > 0. 

These tests indicate that there is no interest in the slope p t or the variance a 1 
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On the other hand the interest may be in the slope rather than the intercept, and 
an investigator could be interested in testing 

p! = 0 versus ,: Pi # 0, 

: Pi < 0 versus j: P { > 0, 

etc. Rather than testing whether the intercept (or slope) is equal to 0 an investi¬ 
gator may be interested in testing whether it is equal to a given number. For 
example he may be interested in testing 

: p 0 =2 versus l : p 0 <2, 

^f 0 : P t = 1 versus Pi # 1, 

etc. We shall derive a test of the hypothesis 

* 0 : pi = 0 versus ■Jf’,: /J, # 0. 

We could just as well derive a test of the hypothesis 


JT 0 : Pi < 0 versus 3fi : Pi > 0 


or a test of the hypothesis 






0 : 

Pi> 0 versus j; 

■Pi 

<0. 

To test 






■*V 

Pi = 0 versus : 

■Pi 

¥= 0, 

one obvious choice for a test statistic is 





Vvar [fij 

Under the random variable T is distributed as Student’s t distribution with 
n — 2 degrees of freedom. Thus a test procedure with size a is the following: 

Reject if and only if | T \ > ti_ x/2 (n - 2). 

By comparing this with Eq. (16) we notice that this test is equivalent to the 
procedure of setting a 1 - a confidence interval on the parameter Pi and rejecting 
the hypothesis if and only if the confidence interval does not contain 0. 

We will now show that this test is a generalized likelihood-ratio test. 
Corresponding to the notation in Chap. IX we note that in testing 

3^ 0 : Pi = 0 versus J^i: Pi # 0 
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the parameter spaces 0, S 0 , and © t are as given below, where 6 = (fi 0 , fi u a 2 ): 
§ ={(^o.^i,ff 2 ): -oo <p 0 < co; -co < & < co; a 2 >0} 

So = {(Po , Pu <? 2 Y - °o < p 0 < <»; Pi = 0; > o} 

Si = S — So • 

We must determine A, where 


&upL(0;y u ...,yJ 

£ — 0 6 So _ 

sup L(6;y u ...,y n ) 

06@ 

L(6;y 1 ,...,y„)=L(p 0 ,p 1 ,o 2 ) 

= ( 2 ^ eXp [~^ (yi ~ P °~ PlXd2 \ 


(17) 


(18) 


and the values of p o , p u a 2 that maximize this for 0 e 0 are the maximum- 
likelihood estimates given in Eqs. (7) to (9). Thus we get 


sup 1,(0; y u y„) 
0eQ 


(2nd 2 )' 


2 y ./2 


exp 




Y. (y.- -h- fti*;) 2 ' 

2a 2 


= (2n a 2 y nl2 e~ nl2 . 


where d 2 = - X O'! _ /?o - Pi x d 2 - To find sup L(0; ..., y n ), we substitute 

K 0 e ©o 

Pi =0 into Eq. (18) above and get 


Lifio - * 2 ) = (2^1 eX P [- I" ^ o)2 • 

But this is the likelihood function for a random sample of size n from a normal 
distribution with mean j8 0 and variance a 2 . The values of j8 0 and a 2 that 
maximize the likelihood function are the maximum-likelihood estimates 


K = y 

and 


ff * 2 = - Z 0 7 .- - y) 2 - 


sup 1,(0; y u ..., y n ) = (a* 2 2n) " /2 exp 
0 e So 




= (a* 2 2ri)~" l2 e~ nl2 . 


Thus 
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We obtain 


X = 



n/2 


for the generalized likelihood-ratio. Instead of X we will examine the quantity 
(n — 2)(A -2/ " — 1), which is a monotonic function of X and hence will give an 
equivalent test function. We get 


x- 2ln - 1 = 


a * 1 - d 2 


5 2 


Z (y t - y) 2 - Z (y± -h- 


Replace p 0 with Po = y ~ PiX in the numerator, and get 


X~ 2ln - 1 = 


Z (y, - JO 2 - Z [(y,- - 30 - fax, - *)] 2 
Z (yt -h- fiix,) 2 


Hence, 


(« - 2)(l" 2/n 


- 1 ) = 


^ 1 Z ( x t - x? 
6 1 


Pi Z (x, - x) 2 /cr 2 
a 2 ja 2 


which is the ratio of the values of two independent chi-square random variables 
(under P { = 0) divided by their respective degrees of freedom, which are 
1 for the numerator and n — 2 for the denominator. Thus (rt — 2)(A -2/ " — 1) 
has an F distribution with 1 and n - 2 degrees of freedom under .W 0 . The 
generalized likelihood-ratio test says to reject if and only if X < X 0 , or if 
and only if 

(n - 2)(A“ 2/ " - 1) > (n - 2)(X; 2 >" - 1) = X* (say), 

or if and only if 

[^1 Z ( x i - x) 2 ]/& 2 > XI , 

where X% is chosen for a desirable size of Type I error. 

Note that (n - 2)(A -2/ " — 1) is the square of 

fii 

Vvar [fij 

and recall that the square of a Student’s /-distributed random variable with n — 2 
degrees of freedom has an F distribution with 1 and n —2 degrees of freedom. 
Thus we have verified that if the confidence-interval statement in Eq. (16) is 
used to test ^f 0 : Pi =0 versus / 0, it is a generalized likelihood-ratio 

test. 

We will generalize this result slightly in the following theorem. 
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Theorem 3 In the linear model given in Definition 1 the generalized 
likelihood-ratio test of size a of Pi = b x (b 1 is a given constant) versus 
p x # b x is given by the following: Use Eq. (16) to set a 1 — a con¬ 
fidence interval on p u and reject Jf 0 if and only if the confidence interval 
does not include b x . I HI 

We shall state a theorem concerning a test of hypothesis on /? 0 , and the 
proof will be asked for in Prob. 18. 

Theorem 4 In the linear model given in Definition 1 the generalized 
likelihood-ratio test of size a of fi 0 = b 0 (b 0 is a given constant) versus 
: f} 0 # b 0 is given by the following: Use Eq. (15) to set a 1 — a 
confidence interval on p 0 , and reject J^ 0 if and only if the confidence 
interval does not include b 0 . //// 

There are many other tests that are of interest for the linear model and the 
interested reader can consult Refs. 17, 29, 31, and 32. 


7 POINT ESTIMATION—CASE B 

For this case Yj, Y 2 , ..., Y n are pairwise uncorrelated random variables with 
means fi 0 + ft l x l , ft 0 + f) 1 x 2 , ..., f} 0 + P x x n and variances a 2 . Since the joint 
density of the Y ( is not specified, maximum-likelihood estimators of /J 0 , p u and 
a 2 cannot be obtained. In models when the joint density of the observable 
random variables is not given, a method of estimation called least-squares can 
be utilized. 

Definition 2 Least-squares Let (Y t , xj, / = 1, 2, ...,«, be « pairs of 
observations that satisfy the linear model given in Definition 1. The 
values of P 0 and p x that minimize the sum of squares 

t(Yi-Po~PiX t ? 

i=i 

are defined to be the least-squares estimators of P 0 and /) 1 . //// 

To find the least-squares estimators of p o and p x , we must find the values 
that minimize 

L{Po,Pi) = t{Y i -p 0 -p 1 x i f, 

i= 1 
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and clearly these are the same values that maximize the likelihood function in 
Eq. (5). Hence we have the following theorem. 


Theorem 5 In Case B of the simple linear model given in Definition 1 the 
least-squares estimators of /? 0 and are given by fi 0 and fi 1; where 


6 _Z(y«-yx*i-s) 
1 


fi 0 = Y — fijX. 


(19) 

IIII 


The least-squares method gives no estimator for er 2 , but an estimator of 
a 2 based on the least-squares estimators of /? 0 and /Jj is 

For Case A the maximum-likelihood estimators of /J 0 , /? 1; and a 2 had some 
desirable optimum properties. The first corollary of Theorem 2 states that 
fi 0 and fi! are uniformly minimum-variance unbiased estimators. That is, in the 
class of all unbiased estimators of /? 0 and /5 1( the estimators fi 0 an d fij in Eq. (13) 
have uniformly minimum variance. No such desirable property as this is 
enjoyed by least-squares estimators for Case B. For Case A the assumptions 
are much stronger than for Case B, where the distribution of the random 
variables Y; is assumed to be unknown; so we should not expect as strong an 
optimality in the estimators for Case B. 

For Case B, we shall restrict our class of estimating functions and deter¬ 
mine if the least-squares estimators have any optimal properties in the restricted 
class. Since &[ Y ( ] = f) 0 + PiX t , we see that (and can be given by the 
expected value of linear functions of the Y,-. Within this class of linear functions 
we will define minimum-variance unbiased estimators. 

Definitions Best linear unbiased estimators Let Yj, Y 2 , ..., Y„ be 
observable random variables such that S[ Y ; ] = r ; (0), where t,-( •) are 
known functions that contain unknown parameters 6 (6 may be vector¬ 
valued). To estimate any Qj in 9, consider only the class of estimators 
that are linear functions of the random variables Y ( . In this class consider 
only the subclass of estimators that are unbiased for 9j. If in this 
restricted class an estimator of 9j exists which has smaller variance than 
any other estimator of 9j in this restricted class, it is defined to be the best 
linear unbiased estimator of 9j (“ best ” refers to minimum variance). //// 
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It should be noted that there are two restrictions on the estimating func¬ 
tions before the property of minimum variance is considered. First, the class of 
estimating functions is restricted to linear functions of the Y t . Second, in 
the class of linear functions of the y ; only unbiased estimators are considered. 
Finally, then, consideration is given to finding a minimum-variance estimator in 
the class of estimating functions that are linear and unbiased. 

We will now prove an important theorem that gives optimum properties 
for the point estimators of p 0 and Pi derived by the method of least squares for 
Case B. This theorem is often referred to as the Gauss-Markov theorem. 

Theorem 6 Consider the linear model given in Definition 1 , and let the 
assumptions for Case B hold. Then the least-squares estimators for 
and p 0 given in Eq. ( 19 ) are the respective best linear unbiased estimators 
for pi and p 0 . 

proof We shall demonstrate the proof for p 0 ; the proof for Pi 
is similar. Since we are restricting the class of estimators to be linear, we 
have fi 0 = X a j Yj ■ We must determine the constant a ; such that: 

(i) <f[6 0 ] = Poi that is, fi 0 is an unbiased estimator of j8 0 ■ 

(ii) var [fi 0 ] is a minimum among all estimators satisfying (i). 

For (i) we must have 

Po = ^[®o] = X a ]^VYj\ = X a 0o + Pi x j)- 

This gives the two equations which must be satisfied 

X oj = 1 

and (20) 

X a s x j = 0- 

Now 


var [fi 0 ] = rf[(fi 0 - po) 2 ] = «f[(X aj Yj - /J 0 ) 2 ] 

= ^[[X a j(P 0 + Pixj + Ej) - p 0 ] 2 ] 

=x a s+Pi x a i x s +x a s E j - m 

By the restrictions of Eq. (20) 

var[fi 0 ] = /[(X a j E jYl = ^[X a j E j + X X a;j- 



7 


POINT ESTIMATION—CASE B 501 


The quantity SlEfif is 0 if i # j since, by assumption, the E t are un¬ 
correlated and have means 0. Hence 

var [6 0 ] = o 2 Z a J- 

Since a 1 is a constant, to minimize var [fi 0 ] we need to minimize Y aj. 
Thus constants aj must be found which minimize £ a] subject to the restric¬ 
tions of Eq. (20). Using the theory of Lagrange multipliers, we must 
minimize 


L = Z a} ~ A i(Z a J 


Taking derivatives, one finds 


1)-^I djXj. 


8L 

— =2a,~ Aj - A 2 x, =0, t = l,2,...,n 
oa t 

8L ^ 

°' 

If we sum over the first n equations, we get (using £ a, = 1) 

2 = nAj + A 2 X x t . 

If we multiply the yth equation in (21) by x } and add, we get 

2 Z x j a j = K Z X J + ^2 Z 

or since X a y x j = 0. this becomes 

Z*? 


— — A 2 


z*< 


If we substitute this into (22), we get 


( 21 ) 


( 22 ) 


(23) 


, _ ~2 Z x t/ n ~ 2x 

2 Z x f - w * 2 Z ( x i - *> 2 

and 

2 Z *?/» 

1 Z(*i-*) 2 ‘ 

Substituting and A 2 into the rth equation in (21) and solving for a, 
gives 

(Z x il n ) ~ xx t 
a ‘ Z (*« - *) 2 
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The best linear unbiased estimator of /J 0 is therefore 




y t*t = F 

Z (*< - 


V, 


which is the one given by least squares, and so the proof is complete. A 
similar proof holds for . //// 


PROBLEMS 

1 Assume that the data below satisfy the simple linear model given in Definition 1 
for Case A. 

y: -6.1 -0.5 7.2 6.9 -0.2 -2.1 -3.9 3.8 

x: -2.0 0.6 1.4 1.3 0.0 -1.6 -1.7 0.7 

Find the maximum-likelihood estimates of p 0 , )3,, and a 2 . 

2 In Prob. 1 find the UMVUE of p o + 3j3,. 

3 In Prob. 1 find a 95 percent confidence interval on p 0 ; on j3,; on cr 2 . 

4 In Prob. 1 find a 90 percent confidence interval on /i(x) for x = — 1.0. 

5 In the simple linear model for Case A find the maximum-likelihood estimator of 6, 
where 6 = j3 0 + 3)3j + 2cr 2 . 

6 In Prob. 5 find the UMVUE of 0. 

7 In the simple linear model for Case A, show that p proportion of the distribution 
of Fat x = x 0 is below $„, where € P = Po + Pix 0 + z p o and z P is given by ®(z„) = p. 

8 In Prob. 7 find the UMVUE of £„. 

9 Use the data in Prob. 1 to evaluate the UMVUE of $ P in Prob. 7. 

10 The hardness Y of the shells of eggs laid by a certain breed of chickens was as¬ 
sumed to be roughly linearly related to the amount x of a certain food supplement 
put into the diet of the chickens. The model was assumed to be a simple linear 
model for Case A. Data were collected and are given below: 

y,\ .70 .98 1.16 1.75 .76 .82 .95 1.24 1.75 1.95 

x,: .12 .21 .34 .61 .13 .17 .21 .34 .62 .71 

Test the hypothesis that j3i = 1.00 versus the hypothesis Pi + 1.00. Use a Type I 
error probability of 5 percent. 

11 In Prob. 10 test the hypothesis |3i > I versus the hypothesis yS, <, 1. 

12 In Prob. 10 test the hypothesis /A.50) > 1.5 versus the hypothesis /A-50) < 1.5. 
Use a Type I error probability of 10 percent. 

13 In Prob. 10 compute a 90 percent confidence interval on 2cr. 

14 In the simple linear model for Case A find the UMVUE of pile 2 . 

15 Consider the simple linear model given in Definition 1 except var [Y,] = aT 1 o 2 , 
where at, 1, 2, .... n, are known positive numbers. Find the maximum- 
likelihood estimators of jS 0 and Pi. 
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16 What are the conditions on the x, in the simple linear model for Case A so that 
fi 0 and fii are independent? 

17 In the simple linear model for Case A show that Y and Si are uncorrelated. Are 
they independent? 

18 Prove Theorem 4. 

19 In Theorem 6 give the proof for the best linear unbiased estimator of |Si. 

20 For the simple linear model for Case B prove that the best (minimum-variance) 
linear unbiased estimator of jS 0 + fli is So + Si, where So and Si are the least- 
squares estimators of and f3 lt respectively. 

21 Extend Prob. 20 to c 0 ]S 0 + c,/3,, where c 0 and Ci are given constants. 



XI 

NONPARAMETRIC METHODS 


1 INTRODUCTION AND SUMMARY 

The important place ascribed to the normal distribution in statistical theory is 
well justified on the basis of the central-limit theorem. However, often it is 
not known whether the basic distribution is such that the central-limit theorem 
applies or whether the approximation to the normal distribution is good enough 
that the resulting confidence intervals and tests of hypotheses based on normal 
theory are as accurate as desired. For example, if a random sample of size n 
is taken from a population with a normal density and a .95 confidence interval 
is set about the mean (see Sec. 3.1 of Chap. VIII) then the frequency interpreta¬ 
tion is the following: If repeated random samples are taken from this popula¬ 
tion and if a 95 percent confidence interval is obtained for each random sample, 
in the long run 95 percent of these intervals will contain the mean of the density. 
If sampling is from a density that is not normal, then, instead of 95 percent of 
the intervals containing the mean, it may be 99 or 90 percent, or some other 
percentage. If it is close to 95 percent, say 93 to 97 percent, usually the experi¬ 
menter will be satisfied. However, if it deviates a large amount from the desired 
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percentage, then the experimenter will probably not be satisfied. In cases where 
it is known that the conventional methods based on the assumption of a normal 
density are not applicable, an alternative method is desired. If the basic distri¬ 
bution is known (but is not necessarily normal), one may be able to derive exact 
(or sufficiently accurate) tests of hypotheses and confidence intervals based on 
that distribution. In many cases an experimenter does not know the form of the 
basic distribution and needs statistical techniques which are applicable regardless 
of the form of the density. These techniques are called nonparametric or 
distribution-free methods. 

The term “nonparametric” arises from considerations of testing hypoth¬ 
eses (Chap. IX). In forming the generalized likelihood-ratio, for example, one 
deals with a parameter space which defines a family of distributions as the para¬ 
meters in the functional form of the distribution vary over the parameter space. 
The methods to be developed in this chapter make no use of functional forms 
or parameters of such forms. They apply to very wide families of distributions 
rather than only to families specified by a particular functional form. The term 
“ distribution-free ” is also often used to indicate similarly that the methods do 
not depend on the functional form of distribution functions. 

The nonparametric methods that will be considered will, for the most part, 
be based on the order statistics. Also, although the methods to be presented are 
applicable to both continuous and discrete random variables, we shall direct our 
attention almost entirely to the continuous case. 

Section 2 will be devoted to considerations of statistical inferences that 
concern the cumulative distribution function of the population to be sampled. 
The sample cumulative distribution function will be used in three types of 
inference, namely, point estimation, interval estimation, and testing. Popula¬ 
tion quantiles have been defined for any distribution function regardless of the 
form of that distribution. Section 3 deals with distribution-free statistical 
methods of making inferences regarding population quantiles. Section 4 studies 
an important concept, that of tolerance limits. The similarities and differences 
of tolerance limits and confidence limits are noted. 

In Sec. 5 we return to an important problem in the application of the 
theory of statistics. It is the problem of testing the homogeneity of two popula¬ 
tions. This problem was first mentioned in Subsec. 4.3 of Chap. IX when we 
tested the equality of the means of two normal populations. It was considered 
again in Subsec. 5.3 of Chap. IX when we tested the equality of two multi¬ 
nomial populations. We indicated there that the derived test using a chi-square- 
type statistic could be used to test the equality of two arbitrary populations, and 
so we had really anticipated this chapter inasmuch as we derived a distribution- 
free test. Other distribution-free tests of the homogeneity of two populations 
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will be presented in Sec. 5. Included will be the sign test, the run test, the 
median test, and the rank-sum test. 

In this chapter we present only a very brief introduction to nonparametric 
statistical methods. This chapter is similar to the last inasmuch as it includes 
use of the three basic kinds of inference that were the focus of our attention in 
Chaps. VII to IX. We shall see that much of the required distributional theory 
is elementary, seldom using anything more complicated than the basic principles 
of probability that were considered in Chap. I and the binomial distribution. 


2 INFERENCES CONCERNING A 

CUMULATIVE DISTRIBUTION FUNCTION 

2.1 Sample or Empirical Cumulative Distribution Function 

In Subsec. 5.4 of Chap. VI, we defined the sample cumulative distribution 
function (c.d.f.). We indicated there that it could be used to estimate the cumula¬ 
tive distribution function from which we sampled. In this subsection some 
results about the sample c.d.f. will be reviewed and used to formulate point 
estimates. In the two following subsections the sample c.d.f. will be utilized to 
test a hypothesis (in Subsec. 2.2) and to set a confidence interval (in Subsec. 2.3). 
Recall that (see Definition 13 in Chap. VI) the sample c.d.f. is defined by 

F„(x) = - (number of X i less than or equal to x) 
n 

= m a) 

n j=i 

where X lt X n is a random sample from some c.d.f. F( •). According to 
Theorem 17 of Chap. VI, 

P [*■„(*) = = Q [F(x)] k [l - F(x)f *, k = 0, 1,..., n, (2) 

where F n ( ■ ) is the sample c.d.f. corresponding to c.d.f. F( ■ ). From Eq. (2), 
we see that 

*lFJtx)] = i - (?W)]*[1 - F(x)Y k = F(x) (3) 

k=o n \k/ 

and similarly 

var [F„(x)] = ^ F(x)[l — F(x)]. 


(4) 
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In fact, since F„(x) is the sample mean of random variables / ( _ x](^i). 

..., /(-a,, x ^.X n ), we know by the central-limit theorem that F„(x) is asymptoti¬ 
cally normally distributed with mean F(x) and variance (\jn)F(x)[\ — F(x)]. 

Equations (3) and (4) show that for fixed x, F n (x) is an unbiased and 
mean-squared-error consistent estimator of F(x), regardless of the form of F( • ). 
If one is interested in estimating F(x) for every x (rather than for a fixed x), 
then one is interested in saying something about how close F„(x) is to F(x) 
jointly over all values x; hence the following result is of interest: 

P[ sup I F„(x) — F(x) | -► 0] = 1. (5) 

— oo<x<oo oo 

Equation (5), known as the Glivenko-Cantelli theorem, states that with prob¬ 
ability one the convergence of F„(x) to F(x) is um'form in x. We can define 

D„ = sup I F„(x) — F(x) |. (6) 

“ 00 <X< 00 

D„ is a random quantity that measures how far F„( •) deviates from F( •). 
Equation (5) states that F[lim D n = 0] = 1; so, in particular, the c.d.f. of D n , 

n -»oo 

say F Dn ( ■), converges to the discrete c.d.f. that has all its mass at 0. In the next 
subsection we will consider the limiting distribution of y/n D n . Equation (5) 
tells us that the estimating function F„(x) of the c.d.f. F(x) converges to F(x) 
uniformly for all x with probability one. 

Instead of a point estimate of F(x) = P[X <, x], one might be interested 
in a point estimate of F(y ) — F(x) = P[x < X < y] for fixed x < y. The follow¬ 
ing remark is useful in showing that F„(y ) — F„(x) is an unbiased mean-squared- 
error consistent estimator of F(y) — F(x). 

Remark 

cov [F„(x), F„(y)] = - F(x)[l - F(y)] for y > x. (7) 

PROOF 

cov [F„(x), F„(y)] = cov f t /(-„„«] 

yi i= i n js i j 

= [ih-.M), ih- 

= (;) 2 i t COV [^ ( -«>, Xl(X t ), 

\nj i=iy=i 

= — COV [f(- 00 ,x](-^l), ^(-oo,F](-^l)] 
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= ^W(- 

= i [F(x) - F(x)F(y)] 

= -F(x)Ll-F(y)]. HU 

n 

Using Eq. (7), one sees immediately that 
var [F„(y) - F„(x)] = var [F n (y)] - 2 cov [F„(x), F„(y)] + var [F„(x)] 

= \ [F(y) - F(X)][ 1 - F(y) + F(x)]; 

mean-squared-error consistency of F n (y ) - F„(x) as an estimator of F(y) — F(x) 
follows immediately. 

Rather than estimating Ffx < X <, y], i.e., the probability that X falls 
in some interval, one might consider estimating P[X e B ], i.e., the probability 

n 

that X falls in some set B. It can be shown that (see the Problems) (1 /«) £ I B (X;) 

i= i 

is an unbiased estimator of P[X e B], and 

varf- t UX,)\ = - P[XeB](l - P[Xe B]), 

In J n 

hence, is mean-squared-error consistent. 


2.2 Kolmogorov-Smirnov Goodness-of-fit Test 

We noted above that F„(x) has an asymptotic normal distribution. Equivalently, 
sjn [F„(x) — F(x)] has a limiting normal distribution with mean 0 and variance 
F(x)[l — F(x)]. We now state (without proof) a result that gives the limiting 
distribution of 

\fn D„ = y .n sup | F„(x) - F(x) |. 

— 00 <X< 00 

Theorem 1 Let X u ..., X„, .. . be independent identically distributed 
random variables having common continuous c.d.f. F x ( ■) = F( • ). 
Define 

D„ = d n (X u ..., X„) = sup | F„(x) - F(x)|, 

— 00 < X< 00 
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where F„(x) is the sample c.d.f. Then 

lim f> 0n (x) = lim P\JnD„ < x] (8) 

oo «-»oo 

00 1 

= 1 - 21 (-1 y- / (0 , ^(x) = H(x), say. //// 

The c.d.f. given in Eq. (8) does not depend on the c.d.f. from which the 
sample was drawn (other than that it be continuous); that is, the limiting 
distribution of Jn D n is distribution-free. This fact allows D n to be broadly 
used as a test statistic for goodness of fit. For instance, suppose one wishes to 
test that the distribution that is being sampled from is some specified continuous 
distribution; that is, test XF 0 \ X t ~ F 0 ( •), where F 0 ( ■ ) is some completely 
specified continuous c.d.f. If is true, 

K„ = 4 n (X l ,...,X n ) = jn sup I F n (x) — F 0 (x) | (9) 

— OO <X< 00 

is approximately distributed as H( ■), the c.d.f. given in Eq. (8). If JF 0 IS false, 
then F n ( ■ ) will tend to be near the true c.d.f. F( ■ ) and not near F 0 ( ■ ), and 
consequently sup | F„(x) — F 0 (x) j will tend to be large; hence a reasonable 

— 00 <X< 00 

test criterion is to reject JF 0 if sup |F„(x) — F 0 (x)| is large. Since 

— 00 <X< 00 

K n = Jn sup | F„(x) - F 0 (x) | is approximately distributed as H{ ■) when 

- 00 <X< 00 

Xe Q is true and //(•) has been tabulated, k t - x can be determined so that 
1 — Hiki-J = a, and hence P[K„ > Ar, _„] « a. That is, the test defined by 
“Reject .tf’o if and only if K„ > k { ” has approximate size a. Such a test is 
often labeled the Kolmogorov-Smirnov goodness-of-fit test. It tests how well a 
given set of observations fits some specified c.d.f. F 0 ( • ). The fit is measured 
by the so-called Kolmogorov statistic sup | F„(x) - F 0 (x)|. Theorem 1 gives 

— 00 < JC< 00 

an asymptotic distribution for D n . The exact distribution of D n has been 
tabled for various n. See Ref. 44. 


EXAMPLE 1 A question of at least curious interest is the following: Are 
the times of birth uniformly distributed over the hours of the day? For 
37 consecutive births (actual data) the following times were observed: 
7:02 p.m., 11:08 p.m., 3:56 a.m., 8:12 a.m., 8:40 a.m., 12:25 p.m., 1:24 a.m., 
8:25 a.m., 2:02 p.m., 11:46 p.m., 10:07 a.m., 1:53 p.m., 6:45 p.m., 9:06 a.m., 
3:57 p.m., 7:40 a.m., 3:02 a.m., 10:45 a.m., 3:06 p.m., 6:26 a.m., 4:44 p.m., 
12:26 a.m., 2:17 p.m., 11:45 p.m., 5:08 a.m., 5:49 a.m., 6:32 a.m., 12:40 p.m., 
1:30 p.m., 12:55 p.m., 3:22 p.m., 4:09 p.m., 7:46 p.m., 2:28 a.m., 10:06 a.m., 
11:19 a.m., 4:31 p.m. Both the hypothesized uniform c.d.f. and the sample 
c.d.f. are sketched in Fig. 1. 
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One can calculate -Jn sup |F B (x) - F(x)| = v^lfr ~ u°o 1 * -85- 

X 

The critical value for size a = .10 is greater than 1.22; so, according 
to the Kolmogorov-Smirnov goodness-of-fit test, the data do not indicate 
that the hypothesis that times of birth are uniformly distributed through¬ 
out the hours of the day should be rejected. //// 

The Kolmogorov-Smirnov goodness-of-fit test assumed that the null 
hypothesis was simple; that is, the null hypothesis completely specified (no 
unknown parameters) the distribution of the population. One might inquire 
as to whether such a goodness-of-fit testing procedure can be extended to a 
composite null hypothesis which states that the distribution of the population 
belongs to some parametric family of distributions, say {F( •; 8): 8 e 0}. For 
such null hypotheses, sup | F„(x) — F(x; 8) | is no longer a statistic since it depends 

X 

on an unknown parameter 8. An obvious way of removing the dependence on 
8 is to replace 8 by an estimator, say 0, similar to what was done in the classical 
chi-square goodness-of-fit test. The test statistic then becomes sup |F„(x) 

X 

— F(x; 0) |. The distribution of such a test statistic is not known and, in general, 
depends on the hypothesized parametric family. Although some studies (often 
Monte Carlo) have been reported in the literature, much remains to be done 
before a Kolmogorov-Smirnov goodness-of-fit test for composite hypotheses 
becomes a practical testing tool. 
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2.3 Confidence Bands for Cumulative Distribution Function 

Theorem 1 can also be used to set confidence bands on the c.d.f. F( • ) sampled 
from. Let k y be defined by H(k y ) = y, where H{ • ) is the c.d.f. in Eq. (8). A 
brief table of y and k y is 


It follows that 


y 

.99 .95 

.90 

.85 

.80 

ky 

1.63 1.36 

1.22 

1.14 

1.07 


F[ v /« sup 

1 F„ix) 

- Fix) | 

< k y ] » y, 


X 


but 


P[s/n sup | F„(x) - F(x) | < k y ] = F[sup \F n (x) - F(x)\ < k.jjn\ 

X X 

= r\kn(x) - ^j= < F(x) < F„(x) + for all xl, 

L V n V" J 

noting that 


if and only if 


sup I F n (x) — F(x) | <-f 
* V" 


F„(x) - ^j= < F{x) < F„ix) + 

V" v w 


for all x. Using the fact that 0 < F(x) ^ 1, we have 
P |max |o, F„(x) — -^= < Fix) < min jV„(x) + —, 1 j for all xj 


y; (10) 


that is, the band with lower boundary defined by L(x) = max [0, F„(x) — k-J-Jn] 
and upper boundary defined by U(x) = min [F„(x) + k.Js/n, 1] is an approxi¬ 
mate lOOy percent confidence band for the c.d.f. F( • ), where the meaning of the 
confidence band is given in Eq. (10). 
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3 INFERENCES CONCERNING QUANTILES 


3.1 Point and Interval Estimates of a Quantile 

Throughout this section, we will assume that we are sampling from a continuous 
c.d.f., say F( ■ ). Recall (see Definition 17 of Chap. II) that the qth quantile of 
c.d.f. F( • ), denoted by £ 4 , is defined by F(£ 9 ) = q for fixed q, 0 < q < 1. In 
particular, for q =i, is called the median. We saw in Subsec. 4.6 of Chap. II 
that quantiles can be used to measure location and dispersion of a c.d.f. For 
instance, £^, (£ 4 + -«)/2, etc., are measures of location, and £. 9 — £ tl , 

£.75 — £. 25 , etc., are measures of dispersion. 

In Subsec. 2.1, we considered estimating F(x) for fixed x; now, we con¬ 
sider estimating £ 4 such that F(t q ) = q for fixed q. We know that if AT is a 
continuous random variable with c.d.f. F( •), then the random variable F(X) 
has a uniform distribution (see Theorem 12 in Chap. V) over the interval (0, 1). 
Hence F(Yj) has the same distribution as theyth order statistic from a uniform 
distribution, and we know that &[F(Yj)]=j/(n+ 1). (As usual, Y 1( ..., Y„ 
are the order statistics corresponding to the random sample AY, ..., X„.) 
Consequently, we might estimate £, with Y} if q « jj(n + 1). [If j/(n + 1) < q < 
(j + l)/(« + 1), one could estimate £, by interpolating between the order statistics 
Yj and Y, +1 .] 

A confidence-interval estimate of £ ? can be obtained by using two order 
statistics, the interval between them constituting the confidence interval. We 
are interested in computing the confidence coefficient for a pair of order statistics. 

P[Yj <£,< YJ = P[F(Yj) < F(£,) = q < F(Y k )] 

= 1 - P[F(Yj) > q] - P[F(Y k ) < q] 

= P[F(Yj)<q]-P[F(Y k )<q]. 

Recall that 


fyj(y) = TO] ; '‘[1 - F(y)r J f(yY, 

hence for Z = F(Yj) 


dz 

dy 


=f(y), 


and so, 


fz( z ) = 


dz/dy 


n\ 


fyj(y) (j-l)l(n-jy. 


— z } i (l -zf- J 


for 0 < z < 1. 
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Thus, 


P[F(Yj) < u] = f f z (z)dz 

J n 


dz 


= ——1- [ z J -\ i - z y- J+l ~' 

= i B «0'>« -j +1)» 

called the incomplete beta function, which is extensively tabulated. Hence, 


P[Yj <i q < Y k ] = IB,a n -j + 1) - IB#, n-k + 1), 

which is the confidence coefficient of the interval ( Yj , Y k ). In practice, of course, 
we are interested in going in the other direction; that is, for fixed y pick j and k 
(and consequently order statistics Y } and Y k ) such that 

IB#, m ~j + 1) - IB#, n-k + 1) = y, 

and then (Yj, Y k ) is a lOOy percent confidence interval for £,. Of course, for 
arbitrary y there will not exist a j and k so that the confidence coefficient is 
exactly y. 

The confidence coefficient can be obtained another way. 


P[Yj < t q <L Y k ] = P[Yj < i q ] - P[Y k < {,]. 
But 


P[Yj < £,] = P[/'th order statistic < £,] 

= P[j or more observations < t; q \ 

n 

= Yj ^[exactly i observations < £ q ] 

i= J 

= i (")to)]‘[i - nQr * 

hence, 

P[Yj <H q < Y k .1 = t (".y(l - qf-‘ - t (")<?# - qr‘ 

-% (")«'(' -XT'- 

Note that a table of the binomial distribution can now be used to evaluate the 
confidence coefficient. 
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EXAMPLE 2 For a sample of size 10, what is the confidence coefficient of 
the interval ( Y 2 , Y g ), which is a confidence-interval estimator of the pop¬ 
ulation median ? We have 


P[Y 2 <^< Y 9 ] = 



.9784. 


//// 


We have presented one way, using order statistics, of obtaining point esti¬ 
mates or confidence-interval estimates for a quantile. 

Besides being extremely general in that the method requires few assump¬ 
tions about the form of the distribution function, the method is extraordinarily 
simple. No complex analysis or distribution theory was needed; the simple 
binomial distribution provided the necessary equipment to determine the con¬ 
fidence coefficient. The only inconvenience was the paucity of confidence levels 
that could be attained. 


3.2 Tests of Hypotheses Concerning Quantiles 

Let X u ..., X„ denote a random sample from a probability density function, 
say /(• ). Suppose that it is desired to test that the 17 th quantile of the popula¬ 
tion sampled from is a specified value, say £. That is, it is desired to test 


Xeo'. Z q = i versus : £ q # i, 


where /(•) is unspecified (other than being a probability density function). 
The confidence-interval method of deriving a test (see Subsec. 3.4 of Chap. IX) 
can be used; for instance, obtain a 100 y percent confidence interval for £ , and 
accept if and only if the derived confidence interval contains £. Such a test 
has size 1 — y. 

An alternative test is the so-called one-sample sign test. It is a very simple 
test based on the value of a statistic that represents the number of the n trans¬ 
formed observations that have a positive sign. To illustrate the principle in¬ 
volved in the sign test, consider testing 3^ 0 : Z q = £ versus JY 1 : £ q # £ for a 
random sample X t , ..., X„ from some unspecified probability density function 
/(• ). Let Z denote the number of X/s that exceed £. Equivalently, Z is the 
number of X t — X„ — l; that have a positive sign. If is true, Z has a 

binomial distribution with parameters n and p = 1 — q = Jf f(x) dx. So if 
is true, one would expect Z to be near np, and hence an intuitively appealing 
test is to accept JY 0 if and only if Z is near np. Since the distribution of Z is 
known, one can determine what is meant by “ near ” by fixing the size of the test. 
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For example, suppose g = \ so that <j; 4 = = median; then a possible test of 

£} = £ versus ¥= t is to accept XC 0 if and only if |Z — «p| = 

| Z — n/2 1 < c, where c is a constant determined by 

P[\Z-n/2\ <c] = 1 - a, 

where a is the desired size of the test. Now 



so c can be determined from a binomial table. (For small sample sizes, not 
many a’s are possible, unless randomized tests are used.) The power function 
of such a test can be readily obtained since the distribution of Z is still binomial 
even when the null hypothesis is false; Z has the binomial distribution with 
parameters n and p = P[X > £]. Such a power function could be sketched as a 
function of p. 

Note also that the sign test can be used to test one-sided hypotheses. For 
instance, in testing Xf 0 : ^ versus XP x : > £, the sign test says to reject 

o if and only if Z, defined as above, is large. Again the power function can 
be easily obtained. 


4 TOLERANCE LIMITS 

An automatic machine in a ball-bearing factory is supposed to manufacture 
bearings .25 inch in diameter. The bearings are regarded as acceptable from an 
engineering standpoint if the diameter falls between the limits .249 and .251 inch. 
Production is regularly checked each day by measuring the diameter of a random 
sample of bearings and computing statistical tolerance limits L x and L 2 from 
their samples. If L x is above .249 and L 2 is below .251, the production is 
accepted. How large should the sample be so that one can be assured with 
90 percent probability that the statistical tolerance limits will contain at least 
80 percent of the population of bearing diameters? There is a simple non- 
parametric solution to problems of this kind. 

In more general terms, let/( •) be a probability density function, and on 
the basis of a sample of n values it is desirable to determine two numbers, say 
L x and L 2 , such that at least .80, say, of the area under /(• ) is between L x and 
L 2 . On the basis of a sample we cannot be certain that .80 of the area under 
/(• ) is between L x and L 2 , but we can specify a probability that it is so. 
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In other words, we want to find two functions L x = l l (X u .X n ) and 
L 2 = l 2 (X i, ..., X n ) of the random sample X u X n such that the probability 
that 

Cf(x)dx>p ( 11 ) 

is equal to y, for specified y and fi. We summarize with a definition. 

Definition 1 Tolerance limits Let X u ..., X n be a random sample from 
continuous c.d.f. F( ■) having a density function /(•). Let L x = 
l l (X 1 , X n ) <L 2 = l 2 (X t , .... X n ) be two statistics which satisfy: 

(i) The distribution of F(L 2 ) — F(L X ) does not depend on F( ■ ). 

(ii) P[F{L 2 ) - F(L X ) >p\ = y. 

Then L x and L 2 will be defined to be 100)8 percent distribution-free tolerance 
limits at probability level y. IIII 

Remark Note that the random quantity F(L 2 ) — F(L t ) represents the 
area under /(•) between L x and L 2 . //// 


For continuous random variables, order statistics Y } and Y k (j < k ) form 
tolerance limits. To obtain the coefficients )8 and y in the definition of tolerance 
limits, we need the distribution of F{L 2 ) - F{L X ). Recall that 

f J,t Yk()/i ’ yk) = 1 - j )! (n - k)l 

X [FiyfY^iny,) - F(yj)]*-'- j [l - F(y k )r k f(yj)f(y k ). 

Make the transformation Z = F(Y k ) - F(Yj) and Y = F(Yj), find the joint distri¬ 
bution of Y and Z, and then integrate out y to get the marginal distribution of Z. 
The following obtains: 


/zOO =' 


n\ 


- zf~ k+J I {0A fz), 


(k-\-j)\(n-k+j)\ 
which is a beta distribution with parameters k — j and n — k +j + 1. Now 

P[Z < p] = f / z (z) dz = IB^(k -j, n-k +j + 1), 

the incomplete beta function, which is tabled. Also, recall that 

IB,(k -j, n-k+j+l ) = t (")/*'(! - Pr‘- 

Thus for any ft, the probability level y can be computed. 


( 12 ) 
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EXAMPLE 3 For a random sample of size 5, use (Y ls Y 5 ) as a tolerance 
interval for 75 percent of the population; that is, P = .75. What is the 
corresponding probability level y? We seek 

y=P[F(Y s )-F(Y 1 )>.75] 

= 1 - P[F(Y S ) - F(Y l ) < .75] 

= l- ( l(-)(-75)‘'(.25) 5 -'- = .3672. //// 

We might note that, in general, 

S\Z\ = #[F( Y k )] - <!>[F(Yj)] =JL- = tZ-1 . (13) 

n 4- 1 n + 1 n + 1 

For Example 3 S\Z] = \ = f. 


EXAMPLE 4 Suppose that it is desired to determine how large a sample must 
be taken so that the probability is .90 that at least 99 percent of a future 
day’s output of bearings will have diameters between the largest and 
smallest observations in the sample. The quantities are y — .90 and 
(i = .99, and we want to determine n such that 

P[F(Y „) - F(Y t ) > p] = y, 

where the density of Z = F(Y „) — F(Y i ) is given by Eq. (12) for j = 1 and 
k = n. We get 

A 

y =P[Z>p]= n(n - l)z"" 2 (l - z) dz = 1 - tip "- 1 + (n - 1 )p n . 

J D 

If we substitute for y and P, we get the equation 

.90 = 1 - «(.99 )" _1 + (n - 1)(.99)", 

which can be solved to determine n. The solution is n on 388. ///I 

There are similarities and differences between tolerance limits and con¬ 
fidence limits. Tolerance limits, like confidence limits, are two statistics, one 
less than the other, that together form an interval with random end points. The 
user of either interval is reasonably confident (the degree of confidence being 
measured by the corresponding confidence level) that the interval obtained 
contains what it is claimed to. This is where the similarity ends. A confidence 
interval is an interval thought to contain a fixed unknown parameter value. On 
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the other hand, a tolerance interval is an interval thought to contain a prescribed 
proportion of the values of the random variable under consideration. In other 
words, a confidence interval is an interval thought to contain an unknown fixed 
parameter value that characterizes the distribution of population values, whereas 
a tolerance interval is an interval thought to contain actual population values, 
and not some characteristic of them. 


5 EQUALITY OF TWO DISTRIBUTIONS 
5.1 Introduction 

In this section various tests of the equality of two populations will be studied. 
As we mentioned in Sec. 1 above, we first studied the equality of two populations 
when we tested that the means from two normal populations were equal in 
Subsec. 4.3 of Chap. IX. Then again in Subsec. 5.3 of Chap. IX, we gave a test 
of homogeneity of two populations. A great many nonparametric methods 
have been developed for testing whether two populations have the same distribu¬ 
tion. We shall consider only four of them; a fifth will be briefly mentioned at 
the end of this subsection. 

The problem that we propose to consider is the following: Let X Y ,...,X m 
denote a random sample of size m from c.d.f. F x ( • ) with a corresponding density 
function f x ( •), and let Y 1( .... Y„ denote a random sample of size n from 
c.d.f. F r ( ■ ) with a corresponding density function f r ( ■ ). (Note that we are 
departing from our usual convention of using Y’s to represent the order statistics 
corresponding to the X’s.) Further, assume that the observations from F x ( ■ ) 
are independent of the observations from F Y ( • ). Test XF 0 \ F x (z) = F r (z) for 
all z versus JF 1 : F x (z) ^ F Y (z) for at least one value of z. In Sec. 2 above we 
pointed out that the sample c.d.f. can be used to estimate the population c.d.f. 
In the case that Xf 0 is true, that is, F x (z) = F Y (z), we have two independent 
estimators of the common population c.d.f., one using the sample c.d.f. of the 
X’s and the other using the sample c.d.f. of the Y’s. Intuitively, then, one might 
consider using the closeness of the two sample c.d.f.’s to each other as a test 
criterion. Although we will not study it, a test, called the two-sample 
Kolmogorov-Smirnov test, has been devised that uses such a criterion. 

We v/ill assume throughout that the random variables under consideration 
are continuous and merely point out at this time that the methods to be presented 
can be extended to include discrete random variables as well. In our pre¬ 
sentation, we will consider testing two-sided hypotheses and will not consider 
one-sided hypotheses, although the theory works equally well for one-sided 
hypotheses. 
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5.2 Two-sample Sign Test 

The first test that we will consider is the two-sample sign test. We shall see 
that, in a certain sense, this test is a nonparametric analog of the paired t test 
(see Prob. 17 in Chap. IX). For this test we assume that the sampling situation 
is such that the X and Y observations are paired; that is, we observe (X 1 , Tj), 
..., (X n , Y n ). One could think that the X observation is “untreated” (a 
control) and the corresponding Y observation is “ treated,” the object of the 
test being to determine if there is a “ treatment ” effect. We wish to test 
tf 0 : F x ( •) = F Y ( ■ ). Assume that (X it 1)), ..., (X„, Y n ) is a random sample 
from some joint distribution F x y ( •, • ). Further assume that F x Y ( •, • ) is 
such that P[X > Y] = P[X < y] = \ when XC^ is true. (Recall that we are 
assuming continuous random variables, and then such an assumption is satisfied 
if X and Y are independent.) Consider a test based on the signs of the 
differences X t — Y { , i = 1, ..., n. For instance, define 

Z i = I (0 ' O0) (X i -Y i ); 

n 

then Z; has a Bernoulli distribution, and consequently S„= Y z ; has a binomial 

(=i 

distribution with parameters n and p = P[Xi > Y f ]. If is true, p = i, and 
S[S n ] = n/2. If the alternative hypothesis is two-sided so that p = / > [A i > Y{\ 
can be either larger or smaller than j , then a possible test criterion is to accept 
if S„ is close to n/2, that is, accept Xf 0 if | S„ — n/2 \ < k, where k is deter¬ 
mined by fixing the size of the test, k is easily determined from a binomial 
table, and we have a very simple test of the equality of the two populations. 

One can see that avoidance of the assumption that X { and Y t are inde¬ 
pendent is desirable. For example, A',- might represent an observation on the 
ith entity before some “treatment” and Y ; the observation on the same entity 
after “treatment.” In such a case one is not likely to have independence of 
X, and Yj since they are observations taken on the same entity, yet one can 
sometimes test that there is no “ treatment” effect by testing that the “before” 
and “after” populations are the same. 


5.3 Run Test 

As before let X u ..., X m denote a random sample from F x ( ■) and Y u . .., Y„ 
a random sample from F y ( • ). A rather simple test of F x (z ) = F Y (z) for 
all z is based on runs of values of X and values of Y. To understand the 
meaning of runs, combine the m x observations with the n y observations and 
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then order (in ascending order of magnitude) the combined sample. For 
example, if m = 4 and n = 5, one might obtain 

yxxyxyyyx. (14) 

A run is a sequence of letters of the same kind bounded by letters of another 
kind except for the first and last position. Thus, in Eq. (14) the ordering 
starts with a run of one y value, then follows a run of two x values, then a run 
of one y value, and so on; six runs are exhibited in Eq. (14). It is apparent that 
if the two samples are from the same population, the x’s and y’s will ordinarily 
be well mixed, and the total number of runs will be large. If the two popula-. 
tions are widely separated so that their range of values does not overlap, then 
the number of runs will be only two, and, in general, differences between the 
two populations will tend to reduce the number of runs. Thus the two popula¬ 
tions may have the same mean or median, but if the x population is concentrated 
while the y population is dispersed, there will be a tendency to have a long y 
run on each end of the combined sample, and there will thus be a tendency 
to reduce the number of runs. A test then is performed by observing the total 
number of runs, say Z, in the combined sample and rejecting if Z is less 
than or equal to some specified number z 0 . Our task now is to determine the 
distribution of Z under 0 in order that for a given test size we may specify z 0 . 

If o is true, it can be argued that the possible arrangements of the 
m x values and n y values are equally likely. It is clear that there are exactly 

( W /« ”) suc ^ arran 8 ements - To find P[Z = z], it is necessary now to count 

all arrangements with exactly z runs. Suppose z is even, say 2k; then there 
must be k runs of x values and k runs of y values. To get A: runs of x values, 
the m x’s must be divided into k groups. We can form these k groups, or runs, 
by inserting k — 1 dividers into the m — 1 spaces between the m x values with no 
more than one divider per space. We can place the k — 1 dividers into the 


ways. Similarly, we can construct the k runs of 


ways. Any particular arrangement of the k runs of x values 


m — 1 spaces in _ j j ways. Similarly, we can construct the k runs of 

y values in _ j j ways. Any particular arrangement of the k runs of x values 

can be combined with any arrangement of the k runs of y values; furthermore, 
the first run in the combined arrangement can be either a run of x values or a 

run of y values; hence there are a total of 2 ^ _ j j arrangements having 

exactly z = 2k runs. Hence 


P[Z = z] = P[Z = 2k] = 


<"=!)(rl) 

lm + n\ 

\ m / 
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Similarly, for z odd 


P[Z = z] = P[Z = 2k + 1] = 


(V) 

/n-! 
\k- 1 

M 

m — 1\ 
k-l) 

(V) 

( 

m -f- n 

m 

) 


(16) 


To test JF 0 with size of Type I error equal to a, one finds the integer z 0 so that 
(as nearly as possible) 


z 0 

£P[Z = z] = a (17) 

z = 2 

and rejects if the observed value of Z does not exceed z 0 . 

The computation involved in Eq. (17) can become quite tedious unless 
both m and n are small. Fortunately, the distribution of Z is approximately 
normal for large samples, and in fact the approximation is usually good enough 
for practical purposes when both m and n exceed 10. If is true, the mean 
and variance of Z are 


and 


<f[Z] = 


2 mn 
m + n 


\ 


(18) 


var [Z] = 


2mn(2mn — m — ri) 
(m + n) 2 (m + n — 1) 


(19) 


The asymptotic normal distribution of Z under has mean and variance given 
in Eqs. (18) and (19). This asymptotic normal distribution can be used to 
determine the critical value z 0 for large samples. 

The run test is sensitive to both differences in shape and differences in 
location between two distributions. 


5.4 Median Test 

Let X u ..., X m be a random sample from F x ( ■ ) and Ij, ..., Y n be a random 
sample from F y ( ■). As in the previous subsection, combine the two samples, 
and order them. Let Z t < Z 2 < ... < Z m+ „ be the combined ordered sample. 
The median test of Jf 0 : F x (u ) = F r (u) for all u consists of finding the median, 
say z, of the z values and then counting the number of x values, say m lt which 
exceed z and the number of y values, say n lf which exceed z. If is true, mi 
should be approximately m/2 and n y approximately n/2. We can use either the 
statistic Mi or the statistic tV, to construct the test. Let us use M x = number 
of X’s which exceed Z, the median of the combined sample. If m + n is even. 
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there are exactly (m + «)/2 of the observations (combined x’s and y’ s) greater 
than the median of the combined sample. (Since we have an even number of 
continuous random variables, no two are equal, and the median is midway be¬ 
tween the middle two.) It can be easily argued that 


P[Mi = m 1 \ = 


( m \( n 
yn^jym + ri)f2 - 


) 



for m + n even and true. A similar expression obtains for m + n odd. 
Such a distribution can be used to find a constant k such that 


and our test is given by the following: 


Mi 



a, 


Reject if and only if 


M i 


m 

2 


> k. 


Just as in the run test, an asymptotic normal distribution of M x can be 
derived, but we will not study it. 


5.5 Rank-sum Test 

A very interesting nonparametric test for two samples was described by 
Wilcoxon and studied by Mann and Whitney. Given two random samples 
X u X 2 , ..., X m and yj, ..., Y n from populations with absolutely continuous 
c.d.f.’s F x ( • ) and F Y ( • ), respectively, one arranges the m + n observations in 
ascending order and then replaces the smallest observation by 1, the next by 2, 
and so on, the largest being replaced by m + n. These integers are called the 
ranks of the observations. Let T x denote the sum of the ranks of the m x values 

m + n 

and T y the sum of the ranks of the n y values. Note that T x + T y = = 

J= i 

(m + n + 1 )(m + n)/2; so T y is a linear function of T x . We could base a test 
on either statistic T x or T y . Let us use T x . T x is linearly related to another 
statistic, which we denote by U. Set 

n m 

u= Z EWoo)(*.-), (20) 

j= i ;=i 

the number of times an X exceeds a Y. For a given set of observations, let 
r u r 2 , r m denote the ranks of the x values, and let xj, ..., x' m denote the 
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ordered x values. Clearly x[ exceeds (r i - 1) y-values, x' 2 exceeds (r 2 — 2) y- 
values, and so on, and x m exceeds (r m — tri) y-values. Hence 


or 


v/ -x v-. m(m + 1) 

u = Z (Ti -i) = X r, - X* = t x - -, 

i = i l 


U = T X - 


m(m + 1) 


( 21 ) 


To find the first two moments of T x , we find the first two moments of U. 


*[V] = ^EZ hr,. .)(**)] = ZZ *\kr» -)W] 

= YLP[X i > Y J ] = YlP = mnp, 

where 


p = P[Xi > V y ] = J P[Y <x\X = x]f x (x) dx = j F Y (x)f x (x) dx. 

If X ?o is true, 

P =- [ F x (x)f x (x) dx = £ udu = i. 

Similarly, the variance of U can be found. The derivation is somewhat more 
complicated since one needs the expected value of U 2 . From the mean and 
variance of U, the mean and variance of T x can be obtained. If XF 0 is true, 
they are given by 


and 


*ir,] - m — t," + ’> 


( 22 ) 


var [T x ] = 


mn(m + n + 1) 
12 


(23) 


The exact distribution of T x turns out to be a very troublesome problem 
for large m and n. However, Mann and Whitney have calculated the distribu¬ 
tion for small m and n, have shown that T x is approximately normally distributed 
for large m and n, and have demonstrated that the normal approximation is 
quite accurate when m and n are larger than 7. Thus for samples of reasonable 
size one can use the normal approximation with mean and variance given by 
Eqs. (22) and (23) to find a critical region for testing 3f 0 : F x (z) = F r (z) for all z 
versus : F x (z) / F v (z). The test would be the following: 

Reject if | T x - £ J [T X \\ is large; 
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that is, 


Reject Xf 0 if and only if | T x - <$[T X \ \ > k, 


where k is determined by fixing the size of the test and using the asymptotic 
normal distribution of T x . 


EXAMPLE 5 Find the exact distribution of T x under XF 0 for m = 3 and 
n = 2. Each of the following arrangements is equally likely if Xf 0 is 
true: 


xxxyy,xxyxy,xxyyx, xyxxy,xyxyx, 
xyyxx, yxxxy, yxxyx, yxyxx, yyxxx. 

The corresponding T x values are, respectively, 6, 7, 8, 8, 9, 10, 9,10,11, 12; 
so 

P[T X = 6]=P[T X = 1]= iV, 

P[ Tx = 8] = P[T X = 9] = P[T X = 10] = t 2 o, 

and 

P[T X = 11] = P[T X = 12] = -jV //// 

PROBLEMS 

1 " 

1 Show that r= - 2 h>(Xt) is an unbiased estimator of P[X e B\ Find var [71, 

n i= i 

and show that T is a mean-squared-error consistent estimator of P[X e B\. 

1 » 

2 Define F„(Bj) = - 2 for j =1,2. Find cov [F„(B,), F„(B 2 )]. 

n i=i 

3 Let ..., Y„ be the order statistics corresponding to a random sample of size n 
from a continuous c.d.f. F(-). 

(a) Find the density of F( Yj). 

( b) Find the joint density of F{ Y t ) and F( Yj). 

(c) Find the density of [F( Y n ) - F( K 2 )]/[F( Y„) - F( K,)]. 

4 Let Xi,... ,X„ be independent and identically distributed random variables 
having common continuous c.d.f. F(-)- Let Fi < • • • < Y„ be the corre¬ 
sponding order statistics, and define F„(-) to be the sample c.d.f. Set D„ = 

sup | F„(x) — F(x) |. 

- QO <X < 00 

(a) Find the exact distribution of D„ for n=\. 

( b) Do the same for n = 2. Hint: Does D„ = max [F(Fi), l — F(Y 1 ),F(Y 2 ) — i, 
1 -F(F 2 )]? 

(c) Argue that the exact distribution of D„ will not depend on F(-). 
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5 Show that the expected value of the larger of a random sample of two observations 
from a normal population with mean 0 and unit variance is 1/V tt and hence that 
for the general normal population the expected value is p. + ajV tt. 

6 If ( X , y) is an observation from a bivariate normal population with means 0, unit 
variances, and correlation p, show that the expected value of the larger of X and Y 
is V(\ — p)ln. 

7 We have seen that the sample mean for a distribution with infinite variance (such 
as the Cauchy distribution) is not necessarily a consistent estimator of the popula¬ 
tion mean. Is the sample median a consistent estimator of the population median ? 

8 Construct a (approximate) 90 percent confidence band for the data of Example 1. 
Does your band include the appropriate uniform distribution? 

9 Let Yi <••• < y 5 be the order statistics corresponding to a random sample from 
some continuous c.d.f. Compute P[Yi < £. 50 < y 5 ] and P[Y 2 < (f.so < K*]. 
Compute Ft Ki < £. 2a < Y 2 ]. Compute P[y 3 < £.75 < y 5 ]. 

10 Let y 3 and Y n be the first and last order statistics of a random sample of size n 
from some continuous c.d.f. F(-). Find the smallest value of n such that 
/>[F(y,)-F(y 1 )>.75]>.90. 

11 Test as many ways as you know how at the 5 percent level that the following two 
samples came from the same population : 


X 

1.3 

1.4 

1.4 

1.5 

1.7 

1.9 

1.9 

y 

1.6 

1.8 

2.0 

2.1 

2.1 

2.2 

2.3 


12 Let Xi, ..., X 5 denote a random sample of size 5 from the density f(x; ff) = 

0 + t,(x). Consider estimating 0. 

( a ) Determine the confidence coefficient of the confidence interval (Ti, Y 5 ). 

( b ) Find a confidence interval for 6 that has the same confidence coefficient as in 
part (a) using the pivotal quantity (Ti + Y 5 )l2 — 6. 

(c) Compare the expected lengths of the confidence intervals of parts (a) and ( b ). 

13 Find var [C] when F x { ) = F,( ). See Eq. (20). 

14 Equation (21) shows that U and T x are linearly related. Find the exact distribution 
of U or T x when XY a is true for small sample sizes. For example, take m= 1 , 
n = 2; m= 1, n= 3; m = 2, n= 1; m~ 3, n= 1; and m — n = 2. 

15 We saw that S[U~\ = mnp. Is U/mn an unbiased estimator of p = P[X t > y,] 
whether or not Xf 0 is true? Is U a consistent estimator of pi 

16 A common measure of association for random variables X and Y is the rank 
correlation , or Spearman's correlation. The X values are ranked, and the observa¬ 
tions are replaced by their ranks; similarly the Y observations are replaced by their 
ranks. For example, for a sample of size 5 the observations 


X 

20.4 

19.7 

21.8 

20.1 

20.7 

y 

9.2 

8.9 

11.4 

mm 

10.3 
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are replaced by 


r(x) 

3 

1 

5 

2 

4 

r(y) 

2 

1 

5 

3 

4 


Let r(X t ) denote the rank of X, and r(Y t ) the rank of Y t . Using these paired 
ranks, the ordinary sample correlation is computed: 


Spearman’s correlation = 


2 HX t ) - fW][r(y,) - fQQ] 

V2 W*) - W 2 - r(Y)] 2 ’ 


where r(X) = 2 r(X,)/n and r(Y)= 2 r(Y,)/n. 

(a) Show that S = 1 — 6 2 D 2 J(n 3 - n), where D , = r(X,) - r(Y,). 

(,b ) Compute the ordinary correlation and Spearman’s correlation for the above 
data. 

17 Argue that the distribution of S in Prob. 16 is independent of the form of the 
distributions of X and Y provided that X and Y are continuous and independently 
distributed random variables. Hence S can be used as a test statistic in a non- 
parametric test of the null hypothesis of independence. 

18 Show that the mean and variance of S (in Prob. 17) under the hypothesis of inde¬ 
pendence are 0 and 1 /(« — 1), respectively. 



APPENDIX A 
MATHEMATICAL ADDENDUM 


1 INTRODUCTION 

The purpose of this appendix is to provide the reader with a ready reference to some 
mathematical results that are used in the book. This appendix is divided into two 
main sections: The first, Sec. 2 below, gives results that are, for the most part, com¬ 
binatorial in nature, and the last gives results from calculus. No attempt is made to 
prove these results, although sometimes a method of proof is indicated. 


2 NONCALCULUS 


2.1 Summation and Product Notation 

A sum of terms such as n 3 + n 4 + n 5 + 4- n 7 is often designated by the symbol 

7 

2 . 2 is the capital Greek letter sigma, and in this connection it is often called the 

( = 3 

summation sign. The letter i is called the summation index. The term following 2 is 
called the summand. The “ / = 3 ” below 2 indicates that the first term of the sum is 
obtained by putting i = 3 in the summand. The “ 7 ” above the 2 indicates that the 
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final term of the sum is obtained by putting i = 7 in the summand. The other terms 
of the sum are obtained by giving i the integral values between the limits 3 and 7. Thus 

i (- 1 y - 2 jx 2J = 2x* - 3x 6 + 4jc 8 - 5jc 10 . 

J = 2 


An analogous notation for a product is obtained by substituting the capital 
Greek letter O for 2- In this case the terms resulting from substituting the integers 
for the index are multiplied instead of added. Thus 




EXAMPLE 1 Some useful formulas involving summations are listed below. 
They can be proved using mathematical induction. 


" n(n + 1) 

z *= ~ 

i = i 2 


(1) 

” n(n + 1)(2 n + 1) 

(2) 

A ' * 6 


" [«(«+1)‘ 

2 

(3) 

2 J 

* 

n(n + l)(2n + 1)(3« 2 + 3n — 1) 

(4) 

iti - 

30 


Equation (1) can be used to derive the following formula for an arithmetic series 
or progression: 

2 [a + O' - 1M] = na + ^ n(n - 1). (5) 

j=i 2 

A companion series, the finite geometric series , or progression , is given by 

n- 1 ] _ f-n 

l 0 arJ -°—- ( 6 ) 

llll 


2.2 Factorial and Combinatorial Symbols and Conventions 

A product of a positive integer n by all the positive integers smaller than it is usually 
denoted by n\ (read “ n factorial”). Thus 

«! = «(«-D(«-2).l = n (*-/). (7) 

J~0 


0! is defined to be 1. 
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A product of a positive integer n by the next k — 1 smaller positive integers is 
usually denoted by ( n) k . Thus 

( n) k = n(n — 1). (n — k + 1) 

= 11 («-/+!)• ( 8 ) 

j=i 

Note that there are k terms in the product in Eq. (8). 

Remark («)* = «!/(« — A:)!, and («)„ = n!/0! = n\. The combinatorial symbol 
is defined as follows: 


w._ -I 

\kj k\ (n — k)\k) w 

is read “combination of n things taking k at a time” or more briefly as 


“n pick k"\ it is also called a binomial coefficient. Define 


= 0 if k < 0 or k > n. (10) 


Remark 


0/ \n 


("it VO+G-l) forW=1 ’ 2 ’"- and k = 0,±l,±2,.... 

I 

Equation (11) is a useful recurrent formula that is easily proved. 


Both ( n) k and the combinatorial symbol ^ j can be generalized from a positive integer 
n to any real number t by defining 


(/)* = /(*-!). (/-*+ 1 ), 


and L I = 1 for k = 0. 


t\ /(/ — 1). (t - k + 1) 


for k= 1 , 2 ,..., (12) 
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Remark 


/—«\ (—«)(—«— 1). (—n — k+\) 

\ k ) ~ k\ 

= n(n+\) . (n + k - 1) 

~ ( k\ 



III / 


2.3 Stirling’s Formula 

In finding numerical values of probabilities, one is often confronted with the evaluation 
of long factorial expressions which can be troublesome to compute by direct multiplica¬ 
tion. Much labor may be saved by using Stirling's formula, which gives an approximate 
value of «!. Stirling’s formula is 

n\ fa (2ir)*e~ n n n+i (13) 

or 

n\ = (27^)*e■'"«" + *e' ,< " )/12 ' , , (14) 

where 1 — 1/(12« + 1) < r(n) <1. To indicate the accuracy of Stirling’s formula, 10! 
was evaluated using five-place logarithms and Eq. (13), and 3,599,000 was obtained. 
The actual value of 10! is 3,628,800. The percent error is less than 1 percent, and the 
percent error will decrease as n increases. 


2.4 The Binomial and Multinomial Theorems 
The binomial theorem is often given as 


(a + b) n = 2 , 
J=» \J 


a J b n ~ J 


(15) 


for n, a positive integer. The binomial theorem explains why the ^ . j are sometimes 
called binomial coefficients. Four special cases are noted in the following remark. 


(1 + ,) -!(;)''■ (16) 

(i-o-=| o (")(-!m (i7) 

2 - !„(")■ <18) 


Remark 
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and 


0 = 2 (- 1 )' 
J = 0 



Expanding both sides of 


(1+ x)*(l + x)‘ = (l +x)° +b 

and then equating coefficients of x to the nth power gives 



(19) 

llll 


( 20 ) 


a formula that is particularly useful in considerations of the hypergeometric distribution. 
A generalization of the binomial theorem is the multinomial theorem, which is 



i=i 


where the summation is over all nonnegative integers n x , n 2 , n k which sum to n. 
A special case is 

(, l . I ,-)(!“')-11 <*> 

Also note that 

( m \ / n \ mu 

2a,jylbjJ=2 2a,bj. (23) 


3 CALCULUS 

3.1 Preliminaries 

It is assumed that the reader is familiar with the concepts of limits, continuity, differenti¬ 
ation, integration, and infinite series. A particular limit that is referred to several 
times in the book is the limit expression for the number e ; that is, 

lim (1 + x) llx = e. (24) 

x->0 

Equation (24) can be derived by taking logarithms and utilizing l’Hospital’s rule, 
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which is reviewed below. There are a number of variations of Eq. (24), for instance, 

lim (1 + jc -1 )* = e (25) 

X-*ao 

and 

lim (1 + Xx) llx = e k for constant A. (26) 

x-*0 

A rule that is often useful in finding limits is the following so-called /’ Hospital's 
rule: If /(•) and g( ■) are functions for which lim /( x) = lim g(x) — 0 and if 


fix) 
hm—— 
*->« g (a:) 


exists, then so does 


and 


lim 

x-*a 


/(*) 

g(x) ’ 


.. m 

hm —— 
*-« g(x) 


fix) 
hm TT ■ 

x-*a g (a:) 


EXAMPLE 2 Find lim [(1/jt) log, (1 + *)]. Let fix) = log, (1 + x) and g(x) = x; 

x-* 0 

then 

f'(x) 1 ri 1 

lim —— = lim —— = 1 = lim - log, (1 + x) . //// 

,-.o g'(x) *-ol-|-x *-.o |x J 

Another rule that we use in the book is Leibniz' rule for differentiating an integral: 
Let 

»*«> 

If) = fix', t) dx, 

J 9W 

where /(• ; ■ ), g( •), and hi •) are assumed differentiable. Then 

dl r hW 8 / dh , d 9 

<27) 

Several important special cases derive from Leibniz’ rule; for example, if the 
integrand f(x; t) does not depend on t, then 

d /•**** ~| dh do 

it LL m dx \ = fm) * ~ f(9(,)) * ; (28) 

in particular, if g{t) is constant and hit) = t, Eq. (28) simplifies to 

| j‘fix) dxj =/(/). (29) 
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3.2 Taylor Series 

The Taylor series for fix) about x = a is defined as 


fix) = fid) +f <i \a)(x - a) ■ 

where 


f (1) (a)(x — a) 2 


2! 


■ + •••- 


f in \a){x - a) n 


Rny 


(30) 


/<»= 


d‘fjx) 

dx‘ 



/ ( " +1 >(c)(a: — a) n + 1 

(«+D! 


and a<c <x. 


R n is called the remainder, fix) is assumed to have derivatives of at least order n + 1. 
If the remainder is not too large, Eq. (30) gives a polynomial (of degree n ) approxima¬ 
tion, when R„ is dropped, of the function /(•)• The infinite series corresponding to 
Eq. (30) will converge in some interval if lim R„ = 0 in this interval. Several important 

n-*oo 

infinite Taylor series, along with their intervals of convergence, are given in the follow¬ 
ing examples. 


EXAMPLE 3 Suppose fix) = e x and a = 0. Then 


e*-l+x + — + — + 


= y — for— oo<x<oo. 

j = 0 j\ 


(31) 

//// 


EXAMPLE 4 Suppose fix) = (1 — *)' and a = 0; then f a \x) = — /(I — x)‘~ l , 

f (2 \x) = tit - 1)(1 -xY~ 2 ,..., f“\x) =( -1 ) J tit - 1). it-j+ 1)(1 - x)‘-f 

and hence 

/(*) = (!-*)' = | (-l)'(O^ 


-2 o (,•](-*)' for—1 <jc<1. 


There are several interesting special cases of Eq. (32). t——n gives 
(1 - x)-' = i o (“”)(-*)' = i o (” +J .~ 1 jx» for -1 < x < 1; 

— 1 gives the geometric series 


(32) 

//// 

(33) 


a-*)-*= 2 

j~o 


( 34 ) 
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t = —2 gives 

(1 -x)~ 2 = f (J+ l)x J . (35) 

J = 0 


EXAMPLE 5 Suppose /(x) = log, (1 + x) and a — 0; then 

loge (1 + x) = x — ~ ~ ~ + ■ ■ ■ for — 1 < x < 1. (36) 

llll 

The Taylor series for functions of one variable given in Eq. (30) can be generalized 
to the Taylor series for functions of several variables. For example, the Taylor series 
for f{x, y) about x —a and y — b can be written as 

fix, y) = fia, b) + ffa, b)ix - a) + ffa, b)iy — b) + 

[/„(«, b)ix - a) 1 + 2fja, b)ix - a)iy - b) + / yy (a, b)(y - b) 1 ] 4-, 


where 


8f 


X ~a,y ~b 


and similarly for the others. 


/„(«> *>) = 


ay 


8y 8x 


3.3 The Gamma and Beta Functions 

The gamma function, denoted by F( •), is defined by 

T(/) = f x t ~ 1 e~ x dx for/>0. (37) 

J o 

T(r) is nothing more than a notation for the definite integral that appears on the right- 
hand side of Eq. (37). Integration by parts yields 



r<t +1) = /r(/), 

(38) 

and, hence, if t — n (an integer), 


r(«+ i) = «!. 

(39) 

If n is an integer, 

r(« + 1 ) = 

1-3-5 .(2« - 1) y- 

2 " V 7r ’ 

(40) 
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and, in particular, 

r(J)-2IXt)=V^. (41) 

The beta function, denoted by B( ■, •), is defined by 

B(a, 6)=f x a -\\-x)”- 1 dx ior a > 0, b > 0. (42) 

Jo 


Again, B(u, b ) is just a notation for the definite integral that appears on the right-hand 
side of Eq. (42). A simple variable substitution gives B(a, b) = B (b, a). The beta 
function is related to the gamma function according to the following formula: 


B (a, b ) = 


r (a -f b) 


(43) 




APPENDIX 


TABULAR SUMMARY OF PARAMETRIC FAMILIES 

OF DISTRIBUTIONS 


1 INTRODUCTION 

The purpose of this appendix is to provide the reader with a convenient reference 
to the parametric families of distributions that were introduced in Chap. III. Given 
are two tables, one for discrete distributions and the other for continuous distribu¬ 
tions. 
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Table 1 DISCRETE DISTRIBUTIONS 


Name of 
parametric 
family of 
distributions 

--- 

Parameter Mean 

Discrete density functions /(•) ' space n = &[X] 

Discrete 

uniform 

m- .»)(*> n- i,2,... N + 1 

N 2 

Bernoulli 

fix) = pY -*/,o, i)(x) 0 <p < 1 p 

(9 = 1 - p) 

Binomial 

fix) =(”jpY“Y°. 1 .«)(•*) 0 </> < 1 np 

«= 1,2,3,... 

(9= 1 -P) 

Hypergeometric 

1K\IM — 

„ x \x/\ «— X / K 

f&) ~ /jtf\ 1(0,1 . n)(x) M— 1,2,... n — 

U ) 

K= 0,1,..., M 
n= 1, 2. M 

Poisson 

/( x) =~/ ( °. 1 .... ) W A>0 A 

Geometric 

fix) = P9*/ ( o, i. ...)(*) 0< p < 1 ^ 

(9=1 -p) * 

Negative 

binomial 

/«=( r + x x ~ l y<fho.i. ...if*) o< p < i r j 

(9=1-P) 


/ 


* 
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Variance 
a 2 = S[(X- 

-p) 2 ] 

Moments fx' r = <S[X r ] or p., = <%\(X- p) r ] 
and/or cumulants k. 

Moment 

generating 

function 

S[e tx ] 

N 2 - 1 

12 


, N{N+\y 

^3- 4 

, ( N + 1)(2 N + DON 2 + 3 N - 1) 

30 

N 1 

£ b eJ ‘ 

j=iN 

pq 


p' r =p for all r 

q+pe‘ 

npq 


Ps = npq(q - p) 

p 4 = 3n 2 p 2 q 2 + npq(\ — 6 pq) 

(q+pe') n 

K M-KM-n 

m 

£\Y(Y 1 \...(Y h J_ I'll x /x ' 

not useful 

“M M 

M — 1 

®|y*v* i) r+ 1)1 n 

\ r) 

X 


k, = X for r=l,2,... 

p 3 = X 

jtt 4 = A + 3A 2 

exp[A(e'-l)] 

q_ 

p 2 


q + q 2 

P*~ pi 

q + lq 2 + q 3 

P*~ p * 

P 

1 — qe‘ 

rq 


r(q + q 2 ) 

p3 ~ 3 

(, ' J 

p 


P 

r[q + (3r + 4)q 2 + q 3 ] 

p* 

\1 —qe 7 
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Table 2 CONTINUOUS DISTRIBUTIONS 


Name of 
parametric 
family of 
distributions 

Cumulative distribution function F(-) Parameter Mean 

or probability density function /(•) space /x = £[X\ 

Uniform or 
rectangular 

Ax) = / t „. M (x) - oo < a< b < co 

b — a z 

Normal 

o-r.! 

Ax) exp[—(x—(i) 2 l2o 2 ] ~ “ < ^ < 00 fj. 

Exponential 

Ax) = he-* x I, 0 .«»(x) A>0 ^ 

Gamma 

= TOO*" ^ r > 0 A 

Beta 

™ wi,*)'"'* 1 *;s „+» 

Cauchy 

v 1 — oo<a<oo Does not 

7TjS{l + [(jc - oc)//3] 2 } /8>0 exist 

Lognormal 

/« - a > 0 °° expl/x+ia 2 ] 

1 

xV2no exp [- ( lo ^“M) 2 / 2 ^]/ ( o.„,W 

Double 

exponential 

1 / l-*r —■«1 \ —oo< a< oo 

/(X) 20 /8 ) 0>O 
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Variance 

a 2 = S[(X- fj.) 2 ] 

Moments jui = <?[A' r ] 
or fir = /[(Jf- /*)'J 
and/or cumulants k. 

Moment generating 
function S[e ,x ] 

(b — a) 2 

fj. r = 0 for r odd 

e »> - e at 

12 

(6 - ay , 

^~?(r+l) forreVen 

(b — a)t 

a 2 

r! <7 r 

l x r~Q> r odd; ft, (/ . /2) , 2r/2 , r even; 
k, — 0, r > 2 

exploit + i a 2 l 2 } 

i 

A 2 

, r(r+1) 

A r 

-- for t < A 

A-f 

r 

X 2 

f| 

Ch ^ 

II 

a. 

for r < A 

ab 

B(r + a, b) 

not useful 

{a ~j- b + l)(tf + 6) 2 

>ir B(o, 6) 

Does not exist 

Do not exist 

Characteristic function 
is 

exp[2(U + 2 ct 2 ] 

—exp[2^i +ct 2 ] 

H' r = explrfj, + $ r 2 a 2 ] 

not useful 

2f3 2 

fx r = 0 for r odd; 
fj. r = r[ )3 r for r even 

e u 

i - m 2 


( continued ) 
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Table 2 CONTINUOUS DISTRIBUTIONS ( continued ) 

Name of 
parametric 
family of 
distributions 

Cumulative distribution function F(-) Parameter Mean 

or probability density function /(•) space /j. = <%IX] 

Weibull 

f(x) = abx b ~ l exp[—ax b ]I ( o. jj,(x) 6 > 0 a _1,l T'(l + b~ v ) 

Logistic 

Fix) = [1 + «-<—>»]- 1 a<C ° a 

Pareto 

f(x ) _ dx ° , (x) xo > 0 6xo_ 

Jx x > — j^+i t(x 0 ,®)W 6>0 6 — 1 

for 0 > 1 

Gumbel or 
extreme value 

F(x) - exn -co< ot< co a + 

r-W exp ( e 1 p > 0 yX .577216 

t distribution 

IW+D/2] 1 1 ^ = 0 

T(*/2) (1 + xW‘ +1)/2 for k > 1 

F distribution 

. T[(/n + n)/2] (m\ m i 2 n 

f{) r(m/2)T(n/2) \n ) m ’ 1 ’ 2 ’” «-2 

for « > 2 

[1 + (m/ri)xY m+ "' 12 <0 -" ,W 

Chi-square 

distribution 

/W = rp('2r^ ,2 ’ lr<1,2),/(0 ’” ,W *” 1.2,... * 

v "iij 
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Variance 

a 2 = S[(X— n) 2 } 

Moments fx', = 
or fi r = /a)'] 

and/or cumulants /r. 

Moment generating 
function 

a- 2lb [T(l+2b- 1 ) 
-r 2 a +*- 1 )] 

K = «- r/ ‘r(i+0 


/S 2 7 T 2 

3 


e^'irpt csc(rrpt) 

6x1 

fr\r ft r 

does not 

(6-\)\6-2) 

for 0 > 2 

6-' Q_ r Ior 0 > r 

exist 

n 2 p 2 

6 

K r = (- ‘>(1) for r > 2, 

where i/rf-) is digamma function 

e*T(l - pt) 
for t < 1 ip 


ix, = 0 for A > r and r odd 


k 

k — 2 

for A > 2 

£ ,/2 B((r + l)/2, (A — r)/2) 
l> " B(i, */2) 

for k > r and r even 

does not 
exist 

2n 2 (m + n — 2) 
m(« — 2) 2 (n — 4) 

, InY T(m/2 + r>r(n/2 - r) 

fir \m/ r(m/ 2 )r(«/ 2 ) 

does not 
exist 

for n > 4 

for r < ^ 


2k 

, 2 J r(k/ 2 +j) 

N T(k/Z) 

(ry- 


for t< 1/2 
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TABLES 


1 DESCRIPTION OF TABLES 

Table 1 Ordinates of the Normal Density Function 
This table gives values of 

<f>{x) = —j= e ~* 2 ' 2 

v 2tt 

for values of x between 0 and 4 at intervals of .01. For negative values of x one uses 
the fact that <f>{—x) = <f>(x). 

Table 2 Cumulative Normal Distribution 
This table gives values of 

c* 1 r x 

®(x) = —p=e~ , 2 l 2 dt= 

*'-00 V 27 T -GO 


for values of x beteenn 0 and 3.5 at intervals of .01. For negative values of x, one uses 
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the relation ®(—x) = 1 — ®(x). Values of x corresponding to a few special values of <l> 
are given separately beneath the main table. 


Table 3 Cumulative Chi-Square Distribution 

This table gives values of u corresponding to a few selected values of F(u), where 



x 0l-2)/2 e -Jf/2 rfx 

2 nl2 T(n/2) 


for n, the number of degrees of freedom, equal to 1, 2,...» 30. For larger values of n, 
a normal approximation is quite accurate. The quantity \ / 2u—V2n — \ is nearly 
normally distributed with mean 0 and unit variance. Thus u ,, the ath quantile point 
of the distribution, may be computed by 


u « = £(*« + — l) 2 , 


where z t is the ath quantile point of the cumulative normal distribution. As an illustra¬ 
tion, we may compute the .95 value of u for n = 30 degrees of freedom: 

«.95 = i(l-645 + V59) 2 
= 43.5, 


which is in error by less than 1 percent. 


Table 4 Cumulative F Distribution 


This table gives values of F corresponding to five values of 




for selected values of m and n; mis the number of degrees of freedom in the numerator 
of F, and n is the number of degrees of freedom in the denominator of F. The table 
also provides values corresponding to G = .10, .05, .025, .01, and .005 because F,_, 
for m and n degrees of freedom is the reciprocal of F, for n and m degrees of freedom. 
Thus for G = .05 with three and six degrees of freedom, one finds 


F.osO, 6) = 


1 


F. 9S ( 6, 3) 8.94 


1 

= ^ = •112 


One should interpolate on the reciprocals of m and n as in Table 5 for good accuracy. 
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Table 5 Cumulative Students t Distribution 

This table gives values of t corresponding to a few selected values of 



with n = 1, 2, ..., 30, 40, 60,120, co. Since the density is symmetrical in t, it follows 
that F(— /) = 1 — F(t). One should not interpolate linearly between degrees of 
freedom but on the reciprocal of the degrees of freedom, if good accuracy in the last 
digit is desired. As an illustration, we shall compute the .975th quantile point for 
40 degrees of freedom. The values for 30 and 60 are 2.042 and 2.000. Using the 
reciprocals of n, the interpolated value is 


2.042- 


(2.042 - 2.000) = 2.021, 


which is the correct value. Interpolating linearly, one would have obtained 2.028. 
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Table 1 ORDINATES OF THE NORMAL DENSITY FUNCTION 

• Kx ) = —Lr e -* 1 ' 2 
V 2tt 
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Table 2 CUMULATIVE NORMAL DISTRIBUTION 

®(x)= f —■== e"' 2/2 dt 

J - X V2jr 


X 

.00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

.09 

.0 

.5000 

.5040 


.5120 

.5160 

.5199 

.5239 

.5279 

.5319 


.1 

.5398 

.5438 

.5478 

.5517 

.5557 

.5596 

.5636 

.5675 

.5714 


.2 

.5793 

.5832 

.5871 

.5910 

.5948 

.5987 



.6103 

.6141 

.3 

.6179 

.6217 

.6255 

.6293 

.6331 

.6368 

.6406 

.6443 

.6480 

.6517 

.4 

.6554 

.6591 

.6628 

.6664 


.6736 

.6772 

.6808 

.6844 

.6879 

.5 

.6915 

.6950 

.6985 

.7019 

.7054 

.7088 

.7123 

.7157 

.7190 

.7224 

.6 

.7257 

.7291 

.7324 

.7357 

.7389 

.7422 

.7454 

.7486 

.7517 

.7549 

.7 

.7580 

.7611 

.7642 

.7673 

.7704 

.7734 

.7764 

.7794 

.7823 

.7852 

.8 

.7881 

.7910 

.7939 

.7967 

.7995 

.8023 

.8051 

.8078 

.8106 

.8133 

.9 

.8159 

.8186 

.8212 

.8238 

.8264 

.8289 

.8315 

.8340 

.8365 

.8389 

1.0 

.8413 

.8438 

.8461 

.8485 

.8508 

.8531 

.8554 

.8577 

.8599 

.8621 

1.1 

.8643 

.8665 

.8686 

.8708 

.8729 

.8749 

.8770 

.8790 

.8810 


1.2 

.8849 

.8869 

.8888 

.8907 

.8925 

.8944 

.8962 

.8980 

.8997 

.9015 

1.3 

.9032 

.9049 

.9066 

.9082 

.9099 

.9115 

.9131 

.9147 

.9162 

.9177 

1.4 

.9192 

.9207 

.9222 

.9236 

.9251 

.9265 

.9279 

.9292 

.9306 

.9319 

1.5 

.9332 

.9345 

.9357 

.9370 

.9382 

.9394 

.9406 

.9418 

.9429 

.9441 

1.6 

.9452 

.9463 

.9474 

.9484 

.9495 

.9505 

.9515 

.9525 

.9535 

.9545 

1.7 

.9554 

.9564 

.9573 

.9582 

.9591 

.9599 


.9616 

.9625 

.9633 

1.8 

.9641 

.9649 

.9656 

.9664 

.9671 

.9678 

.9686 

.9693 

.9699 

.9706 

1.9 

.9713 

.9719 

.9726 

.9732 

.9738 

.9744 

.9750 

.9756 

.9761 

.9767 

2.0 

.9772 

.9778 

.9783 

.9788 

.9793 

.9798 



.9812 

.9817 

2.1 

.9821 

.9826 

.9830 

.9834 

.9838 

.9842 

.9846 

.9850 

.9854 

.9857 

2.2 

.9861 

.9864 

.9868 

.9871 

.9875 

.9878 

.9881 

.9884 

.9887 


2.3 

.9893 

.9896 

.9898 

.9901 

.9904 

.9906 

.9909 

.9911 

.9913 

.9916 

2.4 

.9918 

.9920 

.9922 

.9925 

.9927 

.9929 

.9931 

.9932 

.9934 

.9936 

2.5 

.9938 

.9940 

.9941 

.9943 

.9945 

.9946 

.9948 

.9949 

.9951 

.9952 

2.6 

.9953 

.9955 

.9956 

.9957 

.9959 


.9961 

.9962 

.9963 

.9964 

2.7 

.9965 

.9966 

.9967 

.9968 

.9969 

.9970 

.9971 

.9972 

.9973 

.9974 

2.8 

.9974 

.9975 

.9976 

.9977 

.9977 

.9978 

.9979 

.9979 

.9980 

.9981 

2.9 

.9981 

.9982 

.9982 

.9983 

.9984 

.9984 

.9985 

.9985 

.9986 

.9986 

3.0 

.9987 

.9987 

.9987 

.9988 

.9988 

.9989 

.9989 

.9989 

.9990 

.9990 

3.1 

.9990 

.9991 

.9991 

.9991 

.9992 

.9992 

.9992 

.9992 

.9993 

.9993 

3.2 

.9993 

.9993 

.9994 

.9994 

.9994 

.9994 

.9994 

.9995 

.9995 

.9995 

3.3 

.9995 

.9995 

.9995 

.9996 

.9996 

.9996 

.9996 

.9996 

.9996 

.9997 

3.4 

.9997 

.9997 

.9997 

.9997 

.9997 

.9997 

.9997 

.9997 

.9997 

.9998 


* 

1.282 





3.090 



4.417 


.90 

.95 

.975 

.99 

.995 

.999 

.9995 

.99995 

.999995 

2[1 -<£(*)] 

.20 

.10 

.05 

.02 

.01 

.002 

.001 

.0001 

.00001 
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•This table is abridged from “Tables of percentage points of the incomplete beta function and of the chi-square distribution,” Biometrika, 
Vol. 32 (1941), It is here published with the kind permission of its author, Catherine M. Thompson, and the editor of Biometrika. 



Table 4 CUMULATIVE F DISTRIBUTION* (m degrees of freedom in numerator; n in denominator) 




































































































































* This table is abridged from “Tables of percentage points of the inverted beta distribution,” Biometrika, Vol. 33 (1943). It is here published 
with the kind permission of its authors, Maxine Merrington and Catherine M. Thompson, and the editor of Biometrika. 
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* This table is abridged from the “Statistical Tables” of R. A. Fisher and Frank Yates published 
by Oliver & Boyd, Ltd., Edinburgh and London, 1938. It is here published with the kind permission 
of the authors and their publishers. 
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Absolutely continuous, 60, 61, 63, 64 
Admissible estimator, 299 
Algebra of sets, 18, 22 
Analysis of variance, 437 
A posteriori probability, 5, 9 
A priori probability, 2-4, 9 
Arithmetic series, 528 

Asymptotic distribution, 196, 256-258,261, 359, 
440, 444 

Average sample size in sequential tests, 470-472 


BAN estimators, 294, 296, 349, 446 
Bayes estimation, 339, 344 
Bayes’ formula, 36 
Bayes risk, 344 
Bayes test, 417 

Bayesian interval estimates, 396, 397 
Bernoulli distribution, 87, 538 
Bernoulli trial, 88 

repeated independent, 89, 101, 137 
Best linear unbiased estimators, 499 
Beta distribution, 115, 540 
of second kind, 215 
Beta function, 534, 535 
Bias, 293 

Binomial coefficient, 529 
Binomial distribution, 87-89, 119, 120, 538 
confidence limits for p, 393, 395 
normal approximation, 120 
Poisson approximation, 119 
Binomial theorem, 530 
Birthday problem, 45 
Bivariate normal distribution, 162-168 
conditional distribution, 167 
marginal distribution, 167 
moment generating function, 164 
moments, 165 
Boole’s inequality, 25 


Cauchy distribution, 117, 207,238, 540 
Cauchy-Schwarz inequalities, 162 
Censored, 104 

Central-limit theorem, 111, 120, 195, 233, 234, 
258 

Centroid, 65 
Chebyshev inequality, 71 
multivariate, 172 

Chi-square distribution, 241, 542, 549 
table of, 553 

Chi-square tests, 440, 442-461 
contingency tables, 452-461 
goodness-of-fit, 442, 447 
Combinations and permutations, 528 
Combinatorial symbol, 528 
Complement of set, 10 


Complete families of densities, 321, 324, 354 
Complete statistic, 324 
Completeness, 321, 354 
Composite hypothesis, 418 
(See also Hypotheses) 

Concentration, 289 
Conditional distributions, 129,148 
bivariate normal, 167 
continuous, 146, 147 
discrete, 143-145 
Conditional expectation, 157 
Conditional mean, 158 
Conditional probability, 32 
Conditional variance, 159 
Confidence bands for c.d.f., 511 
Confidence coefficient, 375, 377, 461 
Confidence intervals, 373, 375, 377, 461 
c.d.f., 511 

difference in means, 386 
general method for, 389 
large sample, 393 

mean of normal population, 375, 381, 384 
median, 512 

method of finding tests, 425, 461 
one-sided, 378 

p of binomial population, 393, 395 
pivotal method of obtaining, 379, 387 
regression coefficients, 491-494 
uniformly most accurate, 464 
variance of normal population, 382, 384 
Confidence limits [see Confidence interval (s)] 
Confidence region, 377 

for mean and variance of normal population, 
384 

Confidence sets, 461 

uniformly most accurate, 464 
Consistency of an estimator, 291, 294, 295, 359 
Contagious distribution, 102,122, 123 
Contingency tables, 452-461 
interaction, 454 
tests for independence, 452 
Continuous distributions, 60, 62 
(See also Distributions) 

Continuous random variable, 60 
Convex function, 72 
Convolution, 186 
Correlation, 155, 161 
sample, 526 

Spearman’s rank, 525, 526 
Correlation coefficient, 155, 156 
Covariance, 155,156 
of two linear combinations of random 
variables, 179 

Covariance matrix, 352, 489 
Cramer-Rao inequality, 316 
Cramer-Rao lower bound, 316, 320 
Critical function, 404 
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Critical region, 403 
size of, 407 

Cumulant generating function, 80 

Cumulants, 80 

Cumulative distribution function(s), 54, 56, 63, 
130,132, 144 
bivariate, 132 
decomposition of, 63 
empirical, 264, 506 
joint, 130 

sample, 264,287, 506,511 
unidimensional, 54, 56 


Decision theory, 297, 343,415 
Deductive inference, 220 
Degenerate distribution, 258 
Degrees of freedom, 242, 246 
De Moivre-Laplace limit theorem, 120 
De Morgan’s law, 13,14 
Density functions, 62 
discrete, 57,135 
joint discrete, 133 
joint probability, 138 
probability, 60, 62 
(See also Distributions) 

Difference between means: 
confidence intervals, 386 
tests, 432 

Differentiation, 532 

Discrete distributions, 57, 133, 144, 145 
(See also Distributions) 

Discrete random variables, 57, 63, 133 

Disjoint, 14 

Distance function, 287 

Distribution-free tests, 505, 509, 514, 518-524 
Distribution functions, 54, 56, 63, 130, 132 
of difference, 185 
of maximum, 182 
of minimum, 182 
of order statistics, 251 
of product, 187 
of quotient, 187 
of sums, 185 
Distributions: 

asymptotic, 196, 256-264, 359, 440,444 

Bernoulli, 87, 236, 538 

beta, 115, 540 

beta-binomial, 104 

binomial (see Binomial distribution) 

bivariate normal, 162-168 

Cauchy, 117, 207, 238,540 

chi-square, 241, 542, 549 

conditional (see Conditional distributions) 

contagious, 102, 122, 123 

continuous, 60, 62 

cumulative [see Cumulative distribution 
function(s)] 


Distributions: 
degenerate, 258 
discrete, 58, 60 
discrete uniform, 86, 538 
double exponential, 117, 540 
exponential, 111, 121,237, 262,540 
F, 246, 247, 542, 549, 554 
gamma. 111, 123, 540 
geometric, 99, 538 
Gumbel, 118, 542 
hypergeometric, 91, 538 
joint, 130, 133, 138 
lambda, 128 
Laplace, 117 

limiting, 196, 258, 261, 444, 446, 507-509 
linear function of normal variates, 194 
logarithmic, 105 
logistic, 118, 540 
lognormal, 117, 540 
marginal, 132, 135, 141 
Maxwell, 127 
multinomial, 137 
covariance for, 196 
multivariate, 129-174 
negative binomial, 99, 538 
negative hypergeometric, 213 
normal (see Normal distribution) 
order statistics, 251,254 
Pareto, 118, 542 
Pearsonian system, 118-119 
Poisson, 93, 104,119-121,123, 236, 538 
prior, 340, 417 
r distribution, 127 
Rayleigh, 127 
rectangular, 105, 238, 540 
sample, 224 

Student’s /, 249,250, 542, 556 
symmetric, 170 
table of, 538-543 
truncated, 122 

Tukey’s symmetrical lambda, 128 
uniform, 105,238, 540 
continuous, 105, 540 
discrete, 86, 538 
variance ratio, 246, 437, 438 
Weibull, 117, 542 
(See also Sampling distributions) 


Efficiency, 291 

Ellipsoid of concentration, 353 
Empty set, 10 
Equivalent sets, 10 
Error: 

mean-squared, 291 
size of, 405 
Type I, 405 
Type II, 405 
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Estimation: 
interval, 372 
point, 271 

Estimator(s), 272, 273 
admissible, 299 
BAN, 294,296 
Bayes method, 286, 339, 344 
best linear unbiased, 499 
better, 299 
closeness, 288 
concentrated, 289 
consistent, 294, 295, 359 
ellipsoid of concentration, 353 
least squares, 286, 498 
location invariant, 334 
maximum likelihood (see Maximum likeli¬ 
hood estimators) 
mean-squared error, 291 
method of moments, 274 
minimax, 299, 350 
minimum chi-square, 286,287 
minimum distance, 286,287 
Pitman, 290, 334, 336 
scale invariant, 336 
unbiased, 293, 315 

uniformly minimum-variance unbiased, 315 
Wilks’ generalized variance, 353 
(See also Large samples) 

Event, 14, 15, 18,53 
elementary, 15 
Event space, 15,18, 23 
Excess, coefficient of, 76 
Expectation, 64, 69, 153, 160, 176 
Expected values, 64, 69, 70, 129, 153, 160 
conditional, 157 

of functions of random variables, 176 
properties of, 70 

Exponential class, 312, 313, 320, 326, 355, 422 
Exponential distribution, 111,121,237,262, 540 
Exponential family, 312, 313, 320, 326, 355, 422 
Extension theorem, 22 
Extreme-value statistic, 118,258 
asymptotic distribution of, 261 


F distribution, 246, 542, 549 
table of, 554 

Factorial moment generating function, 79 
Factorial moments, 77, 79 
Factorial notations, 528 
Factorial symbol, 528 
Factorization criterion, 307 
Finite population, sampling from, 267 
Frequency function, 58 
Function, 19 
beta, 535 
convex, 72 

counterdomain of, 19, 53 


Function: 
decision, 297 
definition of, 19 
density (see Distributions) 
distance, 287 

distribution (see Distributions) 

domain of, 19, 53 

gamma, 534 

generating, 84 

image of, 19 

indicator, 20 

likelihood, 278 

loss, 297 

squared-error, 297 

moment generating (see Moment generating 
function) 
power, 406 
preimage, 19 
probability, 21,22,26 
regression, 158, 168 
risk, 297,298 
set, 20, 21 
size-of-set, 21 


Game of craps, 48 

Gamma distribution, 111, 112, 123,540 
Gamma function, 534 
Gauss-Markov theorem, 500 
Gaussian distribution (see Normal distribution) 
Generalized likelihood ratio (see Likelihood 
ratio) 

Generalized variance, 352, 353 
Generating functions (see specific generating 
functions) 

Geometric distribution, 99, 538 
Geometric series, 528 
Glivenko-Cantelli theorem, 507 
Goodness-of-fit test: 
chi-square, 442, 447 
Kolmogorov-Smirnov, 508, 509 
Gumbel distribution, 118, 542 


Homogeneity of populations, test of, 505 
two exponentials, 476 
two multinomials, 450 
two normals, 432, 435 
two Poissons, 451 
two trinomials, 479 

Homogeneity of variances, test of, 438, 439 
Hypergeometric distributions, 91,538 
Hypotheses, statistical: alternative, 405 
composite, 402 
null, 405 
simple, 402, 409 
(See also Tests of hypotheses) 
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Ideal power function, 406 
Incomplete beta function, 115,513, 516 
Incomplete gamma function, 114 
Independence, 32,150,160 
in contingency tables, 452 
of events, 40, 41, 46 

in probability sense, 32,143, 150, 160, 161 
of random variables, 150 
of sample mean and variance, 243,245 
stochastic, 150 
Index set, 13 
Inference, 220, 271 
deductive, 220 
inductive, 220,221 
Information, 300, 301 
Interquartile range, 75 
Intersection, 10,13 
Interval estimation, 372 
Bayesian, 396 
large sample, 393 
(See also Confidence intervals) 

Invariance: 
location, 331-336 

of maximum likelihood estimators, 284, 285, 
442 

scale, 331, 336-338 
Inverse binomial sampling, 103 


Jacobian, 205 
Jensen inequality, 72 
Joint distributions, 129ff. 
Joint moments, 159 


Kolmogorov-Smirnov goodness-of-fit test, 
508-510 
Kurtosis, 76 


Lagrange multipliers, 501 
Large sample, 294, 358 
confidence limits, 393 
distribution of estimators, 359 
distribution of generalized likelihood ratio, 
440 

distribution of mean, 233 
Law of large numbers, 231, 232,258 
Least squares, 498 

Lehmann-Scheffe theorem, 326, 356 
Leibniz’ rule, 532 
I’Hospital’s rule, 532 
Liars problem, 47 
Likelihood function, 278 
induced, 285 
Likelihood ratio: 

exact distribution of, 480 


Likelihood ratio: 
generalized, 419 

large-sample distribution for, 440 
monotone, 423, 424 
simple, 410, 423 
tests, 409, 419, 440 

Limiting distribution, 196, 258, 261,444, 446, 
507-509 

Linear function of normal variates, distribution 
of, 194 

Linear models, 482, 485 
confidence intervals, 491 
point estimation, 487, 498 
tests of hypotheses, 494 
Linear regression, 482 
Location invariance, 332-336 
Location parameter, 333 
Logistic, 118, 540 
Lognormal distribution, 117, 540 
Loss function, 297, 343, 414, 415 


Marginal distributions: 
for bivariate normal distribution, 167 
continuous, 141 
discrete, 132,135 
Mass, 58 
Mass point, 58 

Maximum of random variables, 182 
Maximum likelihood, principle of, 276 
Maximum-likelihood estimators, 279 
invariance property, 284, 285 
large-sample distribution of, 358, 359 
of parameters: of normal distribution, 281 
of uniform distribution, 282 
properties of, 284, 358 
Mean: 

definition of, 64 
distribution of, 236-238 
sample, 228, 230 
variance of, 231 
Mean absolute deviation, 297 
Mean-squared error, 291, 297 
Mean-squared-error consistency, 294, 295 
Median, 73, 255 
sample, 255 
tests of, 521 

Mendelian inheritance, 445 
Method of moments, 274 
Midrange, 255 

Minimal sufficient statistic, 311 - 

Minimax estimator, 299, 350 

Minimax test, 416 

Minimum of random variables, 182 

Minimum chi-square estimation, 286, 287 

Minimum distance estimation, 286, 287 
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Minimum-variance unbiased estimator, 
uniformly (UMVUE), 315 
Mixture, 122, 123 
Mode, 74 
Model: 

functional, 483 
linear, 483,485 

Moment generating function, 72, 78, 80, 159— 
161, 164 
i'actorial, 79 

of random variables, 159, 160 
table of, 538-543 
Moment problem, 81 
Moments, 64, 72, 73 
central, 73 
cumulant, 80 
estimators of, 227 
factorial, 77 
joint, 159 
population, 227 
problem of, 81 
raw, 72, 73 
sample, 227 

Monotone likelihood ratio, 423, 424 
Multinomial distribution, 137, 253,443 
covariance of, 196 
tests on, 448 

Multinomial theorem, 530, 531 
Multiplication rule, 37 
Multivariate distributions, 129ff. 

marginal and conditional distributions for, 
132,135, 141, 143-147 
moment generating function for, 160 
moments of, 159 
Mutually exclusive, 3, 14, 41 


Negative binomial distribution, 99, 438 
Negative hypergeometric distribution, 213 
Neyman-Pearson lemma, 411 
Nonparametric methods, 504 
confidence intervals, 512 
equality of distributions, 518-524 
interval estimates, 512 
Kolmogorov-Smirnov statistic, 508-510 
median, 512 
median test, 521 
point estimation, 512 
quantiles, 512 
rank correlation, 525 
rank-sum test, 522 
run test, 519 

sign test: one-sample, 514 
two-sample, 519 
tests (see Tests of hypotheses) 
tolerance limits, 515 


Normal distribution, 107-111, 120,239, 540, 548 
bivariate, 162, 525 
conditional, 167 

independence of sample mean and variance, 
243, 245 
marginal, 167 

moment generating function for, 164 
multivariate, 162 
regression functions for, 168 
role of, 239 
sample mean, 240 
sample variance, 241 
table of, 552 
truncated, 124 
Normal equations, 487 
Null hypothesis (see Hypotheses, statistical) 


Order statistics, 251 

asymptotic distribution of, 256-264 
distribution of functions of, 254 


Parameter, 85 
Parameter space, 273, 351 
Parametric family, 85 
Pareto distribution, 118, 542 
Partition, 300 

Pearson’s chi-square tests, 444, 459, 461 
Pitman-closer, 290 
Pitman estimator for location, 334 
Pitman estimator for scale, 337 
Pivotal quantity, 379 
Pivotal quantity method, 379, 387 
Poisson distribution, 93, 104, 119-121, 123, 236, 
538 

compound, 123 
Populations, 222,224 
sampled, 223 
target, 222 

Posterior Bayes estimator, 341 
Posterior distribution, 340 
Posterior risk, 346 
Power function of test, 406, 411 
ideal, 406 

Prior distribution, 340, 417 
Probability, 2 
a posteriori, 5, 9 
a priori, 2-4, 9 
axioms of, 8, 22 
classical, 2, 3, 5 
conditional, 32, 42 
properties of, 34 
definition, 3, 21, 22 
equally likely, 3, 5, 25 
frequency, 2, 5, 6 
function, 21,22 
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Probability: 

independence, 32, 40-42 
integral transformation, 202 
laws of, 23-25, 34-37 
mass function, 58 
models, 8 

properties of, 23-25, 34-37 
space, 25, 53 
subjective, 9 
total, 35 

Probability density function, 57, 60, 62, 141 
Probability function, 58 
equally likely, 26 

Probability generating function, 84 
Probability integral transform, 202 
Problem of moments, 81 
Product of two random variables, 180 
Product notation, 527 
Propagation of errors, 181 


Quantile, 73 

point and interval estimates, 512 
tests of hypotheses, 514 
Quotient of two random variables, 180 


Random number, 107, 202 
of random variables, 197 
Random sampling, 223 
Random variables, 53 
continuous, 60, 138 
definition of, 53 
discrete, 57,133 
joint, 133,138 
maximum of, 182 
minimum of, 182 
mixed, 62 
product, 180 
quotient, 180 
sums, 178 

(See also Distributions) 

Range, 255 
of function, 19 
interquartile, 75 
of sample, 255 
Rank-sum test, 522 
Rao-Blackwell theorem, 321, 354 
Rectangular distribution, 105,238, 540 
References, 544 
Regression, linear, 482 
Regression coefficient: 
confidence interval, 491 
estimators, 491 
tests, 494 

Regression curve, 158 
for normal, 168 


Regression function (see Regression curve) 
Relative frequency, 3, 6, 8 
Reparameterize, 441,442 
Risk: 

Bayes, 344 

function, 297, 298, 415 
posterior, 346 
Runs, 519 


Sample, 222 

cumulative distribution function, 264 
distribution of, 224 
mean, 227, 228, 230 
median, 255 
midrange, 255 
moments, 226, 227 
quantiles, 251 
random, 223 
range, 255 
variance, 229,245 
Sample c.d.f., 264 
interferences on, 506 
Sample mean, 219, 227, 228, 230, 240 
variance of, 231 
Sample moments, 219, 227 
Sample point, 9 
Sample space, 9,14 
finite, 15, 31,25ff. 

Sample variance, 229, 245 
Sampled populations, 223 
Sampling, 27, 219 
with replacement, 27 
without replacement, 27 
Sampling distributions, 219,224 
for difference of two means, 386 
for mean: from binomial population, 236 
from Cauchy population, 238 
from exponential, 237 
of large samples, 232-235 
from normal, 241 
Poisson population, 236 
from uniform population, 238 
for order statistics, 251 ff. 
for ratio of sample variances, 247 
for regression coefficients, 490 


Scale invariance, 336-338 
Scale parameter, 336 
Semiinvariants, 80 

Sequential probability ratio test, 464, 466-467 
approximate, 468 
expected sample size of, 468, 470 
Sequential tests, 464 
for binomial, 481 
fundamental identity for, 470 
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Sequential tests: 

for mean of normal population, 471 
sample size in, 468 
Set, 9 

complements, 10 
confidence, 461 
difference, 10 
disjoint, 14 
empty, 10 
equivalence of, 10 
function, 21 
index, 13 

intersection, 10,13 
laws of, 11 

mutually exclusive, 14 
null, 10 
theory, 9ff. 
union, 10,13 
Sigma algebra, 22 
Significance level, 407 
Significance testing, 407, 409 
Simple hypothesis (see Hypotheses, statistical) 
Singular continuous c. d. f., 63 
Size of critical region, 407 
Size of set, 29 
Size of test, 407 
Skewness, 75, 76 
Space, 9 

event, 1, 14, 15, 18,23, 53 
parameter, 273 
probability, 25 
sample, 1, 14, 31 

Spearman’s rank correlation, 525, 526 
Standard deviation, 68 
Statistic, 226 

Statistical hypotheses (see Hypotheses, 
statistical) 

Statistical inference, 271 
Statistical tests (see Tests of hypotheses) 
Stieltjes integral, 69 
Stirling’s formula, 530 
Stochastic independence, 150 
Student’s t distribution, 249, 250, 542, 550 
table of, 556 
Subset, 10 

Sufficient statistics, 299, 301, 306, 321, 391 
complete, 321, 324, 326 
factorization criterion, 307 
jointly, 306, 307 
minimal, 311, 312, 326 
tests of hypotheses, 408 
Summation notation, 527 
Sums of random variables: 
covariance of, 179 
distribution, 192 
variance of, 178 
Symmetrically distributed, 170 


t distribution (see Student’s t distribution) 
Tables of distributions, 538-543 
chi-square, 553 
F, 554, 555 
normal, 551, 552 
Student’s t, 556 
Target populations, 222 
Taylor series, 533 
Tests of hypotheses, 271,401-403 
Bayes, 417, 418 
chi-square, 440 
composite, 418 
and confidence intervals, 461 
critical function of, 404 
critical region of, 403 
distribution-free (see Nonparametric 
methods) 

equality-of-means, 432,435 
equality of two distributions, 518 
equality of two multinomials, 448 
goodness-of-fit, 442, 447 
homogeneity (see Homogeneity of 
populations) 

homogeneity of variances, 438,439 
independence in contingency tables, 452 
large-sample, 440 
likelihood-ratio: generalized, 419 
simple, 410, 419 

mean of normal population, 428-431 

median, 521 

minimax, 416 

most powerful, 410, 411 

nonrandomized, 403 

null hypothesis, 405 

power of, 406 

randomized, 403, 404 

rank-sum, 522 

ratio of variances, 438 

relation to confidence intervals, 461 

run, 519 

sequential (see Sequential tests) 
of significance, 407 
simple, 409 
size of, 407 

sufficient statistics, 408 
unbiased, 425 

uniformly most powerful, 421 
on variances, 431, 432, 438 
Tests of significance, 407 
Ticktacktoe problem, 62 
Tolerance limits, 505, 515, 516 
Total probabilities, theorem of, 35, 148, 149 
Transformations, 175,198, 202 
c.d.f. technique, 181 
m.g.f. technique, 189 
probability integral, 202, 203 
Treatment effect, 437, 519 



564 INDEX 


Truncated distribution, 103,104,122,123 
normal, 124 
Poisson, 104 
Type I and II errors, 405 
size of, 405 


UMVUE (uniformly minimum-variance 
unbiased estimator), 315 
Unbiased estimator(s), 293, 315, 352 
best linear, 499 
joint, 352 

uniformly minimum variance, 315 
Unbiased test, 425 

Uncorrelated random variables, 161,173, 174 
Uniform distribution, 105,238,540 
Uniformly most accurate, 464 
Uniformly most powerful test, 421 
composite, 421 
simple, 411 
unbiased, 425 
Union of sets, 10,13 


Variance, 67 
analysis of, 437 
conditional, 159 
definition of, 67 
distribution of sample, 241 
estimator of, 229 

of linear combination of random variables, 
178,179 

lower bound for, 315 
sample, 229,245 
of sample mean, 231 
of sum of random variables, 178 
tests of, 431,438 

Variance-covariance matrix, 352, 489 
Variance ratio, 246, 437 
Vector parameters, 351 
Venn diagrams, 11-13 


Waiting time, 101, 103 
Wald’s equation, 470 
Weibul] distribution, 117, 542 
Wilks' generalized variance, 353 



